Ben Scofield

me. still on a blog.

On Ansible

In my post on flexible infrastructures, I mentioned in passing that I was managing my ops work with Ansible rather than the more traditional Chef or Puppet. Several factors guided me towards this choice:

  • The overarching goal for the new infrastructure was to have disposable servers in every role, instead of maintaining long-running servers over time. As a result, I focused on the initial provisioning much more than the ongoing configuration management experience.
  • I wanted the servers to be as similar as possible, but not more than they needed to be. If two distinct roles needed Ruby, I wanted them to use the same version.
  • I was (am) new to this level of operations involvement, so the quicker and easier the learning curve, the better.
  • Less importantly, I was experimenting with the idea of using Packer to create AMIs that we could launch on-demand to fire up new servers. For this, I found local provisioning more intuitive than something centralized.

Starting from Chef

Our existing infrastructure was managed with a standard Chef setup; I spent a full week trying to wrap my head around what we had and adapt it to the new vision. I felt that I had to replace our existing cookbooks because they’d gotten far out of date, but when I pulled in community cookbooks for the software I wanted to install I kept running into conflicts. One would require the redis cookbook, but another would need redisio; one would install Ruby 1.9.3, while another would use 2.0.0. Sure. I could’ve (and did, at first) fork them to get them inline with each other, but then I’d just be setting us up to fall out of date again in the future.

Chef also fell short on the learning curve principle; I felt lost from the start looking through the existing repo we had, and the documentation never quite seemed to answer my questions. I was never clear when something applied to normal Chef vs. chef-solo, for instance – and all the examples I saw started with the full, relatively complicated hierarchical file structure that’s great for when you know what’s going on but rough when you see it unexplained.

Finally, Chef just seemed overly powerful for what I wanted – it’s very obviously Configuration Management, when I just needed a little provisioning tool. This also kept me from digging into Puppet too deeply, especially once I ran across Ansible.

Finding Ansible

I was looking for simpler alternatives to Chef and saw a link to Ansible’s documentation. Within a few minutes, I knew how to do the local provisioning I wanted (by making 127.0.0.1 the only entry in the hosts file) and saw how to start, simply, with a single playbook file. YAML, as much as I hate it as a serialization format (seriously, I hate it for that with a fiery passion) seemed perfectly suited to directives like this:

1
2
3
- name: Generate the Nginx configuration file
  copy: src=nginx.conf
        dest=/etc/nginx/nginx.conf

As I grew more comfortable with how Ansible worked, I started looking at more complicated directory structures and setups using roles, but the key was the ease with which I moved into it – every step was easy and made sense at the time, as opposed to just being dropped in the deep end.

The simplicity of the playbooks (and their direct correlation to the shell commands I’d run to set up the server manually) made it incredibly easy to write my own roles and reuse them for different server types, which made it trivial to keep the dependencies identical wherever possible.

Wrapping up

I hope no one reads this and comes away thinking that I’m saying Ansible is objectively superior to Chef or Puppet. They’re all powerful tools – it’s just that I found Ansible to be the best fit for me, given my objectives and experience. Honestly, the more automation we can get in operations, the better, regardless of the tools used!

That said, if you’re looking to get started with all of this, I think Ansible is well worth a look.

On a Flexible Infrastructure

I’ve been having a lot of fun over the past couple of weeks shading into the ops side of things – I’ve been experimenting with an alternative to the existing infrastructure at my day job, and have been learning a ton. There’s a long way to go, but I think I’ve got the start of an interesting, flexible setup.

Configuration

At the heart of this cluster is a configuration server; I wanted to have new instances easily register themselves and discover existing instances and services. After looking at a few alternatives, I ended up going with etcd, in large part because of its curl-friendly interface (which also powers my experimental Ember.js web UI for etcd, wetcd).

Monitoring

For monitoring, I wanted something that’d easily adjust to instances spinning up and down. Sensu looked like a good fit, so I’ve been giving that a go. It introduces dependencies that might scare some people off (Ruby on every client, for instance), but it’s incredibly flexible.

Analytics

The old standbys of StatsD and Graphite are great; the only outstanding question for me here is what dashboard to use, since Composer is … not the best.

Logs

The piece that I think gets overlooked most frequently in infrastructure setups is log aggregation – when you’re adding and removing instances willy-nilly, you really need a solid, central place to view and analyze logs (especially if you want to keep people away from direct access to the servers). I’m loving Logstash for this, especially since it just added some great features as it hit 1.2. For getting the logs to Logstash, I’m relying on good ol’ rsyslog. Finally, I’m using Kibana 3 to view and analyze the logs.

In Action

So here’s what happens: when a new instance comes up, it first pings the configuration server to find out where monitoring, logging, and the like live so they can be set correctly. Once provisioning is complete, the instance then notifies the configuration server that it is available for whatever role it’s playing.

All boxes get a standard set of monitoring checks by default, in addition to checks specific to whatever they’re running (nginx, Redis, etc.); some of these checks get forwarded on to the analytics server. Finally, all logs get shipped to the log server via rsyslog. It all works together shockingly (from my naive, developer-oriented background) well.

There’s more to talk about (for instance, why I prefer managing all of this with Ansible instead of Chef or Puppet), but I’d love to hear what you all think. Hit me up on Twitter with comments and questions!

On GitHub, DDOSs, and Deploys

When GitHub goes down, you can almost hear the wailing in the streets. GitHub has cemented itself as a central part of many development workflows, and while much of that is unaffected by the occasional DDOS, one element in particular has the potential to cause a lot of trouble: deploys.

Standard practice for many deployment tools (e.g., Capistrano) critically relies on the GitHub repository being available – checking out the latest code either on the remote server itself or locally before pushing it to the remote box. If GitHub’s unavailable, all of that comes to a screeching halt.

Enter: deus_ex (github). It’s a simple RubyGem meant to work around this exact problem.

Say GitHub is being DDOSed and you need to deploy – just install the gem, ensure your AWS credentials are correct in ~/.fog, and:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ deus_ex
[DEUS EX] connection established
[DEUS EX] creating server (this may take a couple of minutes)
[DEUS EX] server created
[DEUS EX] initializing git repo
[DEUS EX] git repo initialized
[DEUS EX] adding local git remote
[DEUS EX] pushing to remote
The authenticity of host 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com (xx.xx.xx.xx)' can't be established.
RSA key fingerprint is xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com,xx.xx.xx.xx' (RSA) to the list of known hosts.
Counting objects: 126, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (78/78), done.
Writing objects: 100% (126/126), 13.70 KiB, done.
Total 126 (delta 51), reused 104 (delta 42)
To ubuntu@ec2-xx-xx-xx-xx.compute-1.amazonaws.com:deus_ex_project.git
 * [new branch]      master -> master
[DEUS EX] removing local git remote
[DEUS EX]
[DEUS EX] you can now deploy from ubuntu@ec2-xx-xx-xx-xx.compute-1.amazonaws.com:deus_ex_project.git

Then, jump over to your deploy tool, set it to look at your new repository instead of GitHub, and deploy away!

Once you’re done, you’ll need to clean up the instance:

1
2
3
$ deus_ex cleanup
[DEUS EX] connection established
[DEUS EX] server destroyed

And be sure to set your deploy tool to look at GitHub again for future deploys!

On a schema.rb Mystery

Imagine, if you will, a database table with your average, everyday foreign key. It’s an INTEGER column.

Now, imagine a coworker opens a pull request that, among other changes, has a line in schema.rb changing that column to a VARCHAR. There’s no associated migration, and when she manually rewrites her schema.rb (THIS IS A BAD PRACTICE AND YOU SHOULD NEVER DO IT), a simple db:migrate allows the offending line to reappear.

What could be going on?


Stumped? So were we. After a lengthy investigation, however, we discovered the problem. The foreign key started as an VARCHAR long ago, which was probably just an oversight. The original developers updated the column type to an INTEGER out of the migration workflow, with the result that their schema.rb files showed it as an INTEGER. Whenever a new employee started and migrated the DB from scratch, however, they got the VARCHAR version – so any schema.rb they generated would try to reset the column to a VARCHAR.

The end solution? Add a migration to properly set the datatype of the field, and everybody’s happy.

On the Conference I’d Like to See

I’m in Sweden at the moment, taking part in an amazing Nordic Ruby conference. Yesterday, I gave a talk called Better: A Field Guide to Continuous Improvement, which was about topics that have been much on my mind over the past few years. Several of the other talks here have dealt – directly or indirectly – with the same topics, and being here has just reinforced how much I’d love to see a specific sort of conference.

In my head, I’m calling this conference BETTER. It would be devoted to the idea of improvement, of getting better, but not solely for professional expertise. The talks would each fall into specific slots on two metrics. First, they’d be divided up by domain:

  • improving professionally (skills),
  • improving socially (moral and ethical), and
  • improving personally (health, mood, welfare)

Then, for each domain, there would be three sorts of talks:

  • research,
  • technology, and
  • practice

Altogether, that makes for nine slots – and as examples for the sort of talks I’d most love to see across those:

So, who’s in? Should I kickstart this ala XOXO? (Or: I wouldn’t be upset if someone took this idea and ran with it.)

On What I Want to Do

We just wrapped the 2013 edition of RailsConf, and I’m both exhausted and excited. There’s nothing quite like being surrounded by 1500 of your peers, all sharing knowledge, experimenting, and having fun for a few days.

Every year at RailsConf, we have a job board – and every year, it fills up on the first day. Hundreds of hands shot up when we asked who was hiring on the first day, and being around that many people thinking about jobs and recruiting and whatnot meant that I got to explain why I’m funemployed (and what would cause me to leave it) several times in the last few days.

When asked that over the past few weeks, I’ve been telling people basically the same thing, so I thought it’d make sense to set it out here. This, then, is what I want to do:

I’m fascinated with feedback as the primary mechanism to improvement. I love the research on the development of expert performance, devices that measure and report on your activity, and experiments that show how our behavior is shaped by the way people and the world respond to us. I’m intrigued by self-tracking, to the extent that when my Nike Fuelband stopped working I bought a Jawbone UP to get me through the couple of days it was off for replacement.

The Fuelband, UP, and other devices represent to me the culmination of a march of progress (that I’ve referred to before). For any given domain,

  1. You start with no tracking
  2. Then, you start tracking – but it’s intermittent and subjective (I ran today)
  3. Next, you start to track events when they happen (keeping a running log in your car)
  4. After that, you start to add technology so that your recordings are more objective (I ran 3.2113 miles in 29:42 – thanks, GPS!)
  5. Once you’ve got technology, you can move to automated recordings (automated tweets of your progress)
  6. And finally, when the tech is small, light, and low-powered enough you can keep it on all day long and measure all activity, not just designated runs

This process describes a continuum from a complete lack of tracking, through sporadic, subjective, imprecise recordings, all the way to objective, continuous, ubiquitous tracking. That’s what I’m interested in – applying that process to different domains, specifically so that people can then look at the data, understand what they’re actually doing every day, and make changes for the better.

These efforts do exist, but for the most part they’ve only advanced in the health field, and more specifically in the general-physical-activity field. Fitbit, Nike Fuelbands, and Jawbone UPs are great, but I see an enormous amount of potential for this same process to take place in other aspects of fitness (for instance, weight training), reading and publishing, software development, and more.

So: if you’re working on something like this, let me know! I’d love to chat, even if I’m not an exact match for what you’re doing.

On Fairness and Developer Salaries

I’m taking advantage of being funemployed by taking a few online courses, including Dan Ariely’s A Beginner’s Guide to Irrational Behavior at Coursera. I’m very familiar with Ariely’s work (I’ve read each of his books and cited various pieces of his research in my talk at SXSW a few years ago), but I’ve been pleasantly surprised as each week has taught me something new.

Case in point: fairness, specifically as it relates to salaries. It’s rare to see developers paid in multiples – even if the 10x developer is a myth, it’s certainly true that some devs are several times more valuable to a company than others. Even in those cases, however, it’s exceedingly rare to see salaries that vary by the same magnitude as the value the worker brings.

As it turns out, research implies that “fairness” as a concept depends more on effort than on results. Imagine how much you’d be willing to pay two people to accomplish the same task (say, fixing your car). The first person takes eight hours and is clearly struggling the whole time, while the second person barely breaks a sweat and fixes it in fifteen minutes. Most people are inclined to pay the first person more, despite the fact that the outcomes are the same – and that the second person imposed much less of an opportunity cost on you (since you’ll have seven and a half hours to drive around in your car that would’ve been lost in the first scenario).

So there’s the problem: concerns of fairness predispose us against paying people according to their actual value, instead moving us towards paying based on the effort they exert. Thus, we probably overpay inexperienced developers (who have to exert more effort to produce a given amount of value) and massively underpay our best developers (who create much more value more easily).

Note that hourly consulting rates codify this intuition – if a task takes longer, you pay more – and that’s only partially offset by different rates. Flat fees for a project (based on the projected value the project will bring) are the clear alternative, but there are known issues with those as well.

On My Productivity

A couple of weeks ago, I posted about my upcoming funemployment (which has since begun). I mentioned in that post that my past jobs have helped me figure out how I’m most productive, and I’ve had a couple of people ask me what exactly I’ve learned about that. So, here goes.

I don’t think I’m particularly unique with any (or at least most) of these items, but I’ve found it helpful to be clear about them to myself.

Timing

Like everybody else, I prefer fairly lengthy stretches of uninterrupted time. Being easily distracted, I find it easiest to get those stretches early in the morning or late at night – I’ve been known to wake up at 3am when I really need to churn on something to hit a deadline.

Location

I’ve built a Cave for myself in my home office, and I love it. I’ve got a bright orange wall, six huge bookshelves bursting with evidence that I loved reading before I got an e-reader, boxes of comics, a couch, a whiteboard currently hosting a 3,000 piece jigsaw puzzle-in-progress, and everything else I might need to relax and get amazing amounts of stuff done.

My Cave also serves as our guestroom, and it’s when other people are in there that I really understand how essential that room is to me. I’ve lived in the house with my Cave inaccessible for over a week, and I nearly pulled my hair out. If I’m close to the Cave but can’t use it, I’m essentially useless.

That said, if I’m far from the Cave – at an actual office, at a conference, or something similar, I can cobble together places to work where I can still be productive. The key for me is being free to move; instead of the comforting sameness of the Cave, I need variety. Coffee shop, hotel desk, hack room, comfy bench – I can work at any of them as long as I can move to another place easily. (This is why I get almost nothing done when visiting family; I can work for a bit at the in-laws house, but it’s rude to up and leave for another venue after an hour.)

That need for freedom also plays into my opinions on on-site vs. remote work. I can be tremendously productive on-site, up to a point. For my best work, however, I need to get out of the office at times and just focus full-bore on the problem. Similarly, I’ve done a ton as a remote employee, but I do need to get into the office periodically to check in and reset relationships with co-workers.

People

I’ve made great friends at every one of my past jobs, and I much prefer working with friends to mere acquaintances. Friendship makes communication easier, which has made otherwise difficult remote positions (which are always constrained by communication issues) very successful.

I also love working with people who are smarter than me. Luckily, that’s no so hard to find; the challenge is making sure I take advantage of it. I end up being more productive when I force myself to talk things over with colleagues.

Products

For much of my career, I didn’t care that much about the specific product I worked on – I was much more focused on enjoying the technical aspects of the work, instead of the effects of what I was building. I can still do that and be productive, but I’m unable to maintain that indefinitely. In fact, the length of time I can “look past” what I’m building to enjoy the process is shortening – it’s probably no more than a couple of months now.

Of course, with increasing time spent on a project, you have the potential to become more productive (as your knowledge of the domain increases, etc.) Thus, I’m going to end up being more productive when I’m in love with the project and can see myself spending a long time working on it.

Ownership

The other problem I had with consulting was one of ownership: even when I was passionate about the project, as a consultant I was largely subject to the whims of the client. Regardless of the strength (or rightness) of my opinions, I could always be overruled by the person writing the checks. I’ve since had the opportunity to work on projects where I had much more control and ownership, and I think that I’ve been much more productive as a result.

Creativity

The last thing I learned is that I have to build things. I can survive for a while without writing code, but if that continues indefinitely then I eventually become useless for almost any task.

So that’s about it, for now. At the moment, I’m trying to look at these factors and figure out how they should influence my search for what’s next. If you’ve got any suggestions, please let me know!

On “Monitoring”

I’m at Monitorama this week; it’s been a great conference, but a weird one for me. This is the first conference I’ve been to in years where I don’t know a significant minority of the attendees, and it’s the first non-Ruby/Rails conference I’ve been to in even longer. I’m enjoying the feeling of not-quite-knowing what’s going on, since I’m not deeply embedded in the DevOps / monitoring movements.

One thing that struck me yesterday during the talks was an issue of vocabulary: many of the speakers seemed to use “monitoring” and “alerting” almost interchangeably – it’s almost as if the purpose of monitoring was just to enable alerting, which is all that matters. (I don’t think that anyone actually holds that opinion, but that’s the way it came across at times).

Later in the day, I had a chat with Mark Imbriaco about just this, and he pointed out that there’s a third term that we need to care about as well: trending.

So, here’s my naïve attempt to clarify the definitions involved here:

Monitoring is the process of gathering data. It provides the foundation for both alerting and trending, but on its own just fills up hard drives and makes pretty graphs.

Alerting is the process of detecting anomalies in monitored data and announcing them to interested parties. This is what most of the DevOps movement appears to care about at the moment, because 1) alerts are what wake people up at 1am when the server’s on fire, 2) alerts are by definition exceptional and require a response (even if that response is “meh, it’ll clear”), and 3) current alerting technology is woefully inadequate, lacking context and even basic intelligence in many cases. Alerts inspire reactive action.

Trending is the process of looking at monitored data for patterns. This is the concept that I think is underemphasized in many current discussions, because alerting is so top-of-mind for everyone, but trending has one huge benefit: it allows you to be proactive. Looking at disk space usage trends may allow you to find and fix a log rotation problem days before it generates a wee-hours alert. Watching page load times may help you optimize code and generate an immediate bump in the number of people who complete a registration process.

I’d love to see trending attacked with the same focus that alerting is currently getting.

(Of course, it might be, and I’m just too far on the outside to see it. If so, I’ll happily accept pointers to that work.)

On My Recent Brush With Rhabdo

First off: thank you to everyone who sent their thoughts and well-wishes. It was extremely heartening to open up Twitter or Facebook and see people hoping that I’d be OK.

OK, so a bit more information on my ill-timed hospital stay. After a hard (but not unreasonably so) Crossfit workout on Tuesday and being sick with a fever on Wednesday and Thursday, I went to the doctor on Friday to get checked out. As it turned out, the levels of creatine kinase in my blood were slightly elevated – normal is 20-300, whereas mine were over 60k (we don’t know the exact level I had because the lab’s scale only went to 60k). I got the call and went to the ER that night, missing the last few hours of Morgan’s birthday.

Extremely high CK levels are the calling card of rhabdomyolysis, which is the result of severe muscle damage. There are a lot of potential causes for rhabdo, including crush injuries, burns, a number of viral infections, and overexertion, among others. Basically, the damaged muscle cells spew their contents into the blood. This can be bad for the kidneys (which can’t filter muscle proteins and can fail as a result) and other organs (as various chemical balances can get thrown for a loop).

In many cases, rhabdo itself is untreatable – the muscles are damaged and you can’t undamage them. (Some of the potential causes result in ongoing muscle damage, but those are very rare.) What you can do is treat to prevent the other problems, the kidney damage, etc. So, once I was in the ER (and for the duration of my stay at the hospital), I was pumped full of IV fluids to keep the kidneys from getting damaged by the muscle proteins in my blood. Luckily, my blood tests showed no evidence of any kidney or other organ damage; all I had were the high CK levels. The hospital was able to be more specific about those levels, though, which was good. On admittance, I was at 65k (200+ times the normal level).

The doctors pretty rapidly agreed that the workout on Tuesday was not the sole cause of the problem given my description of it and the next few days. What they weren’t able to do, however, was settle on the other contributing causes. I might have been predisposed to rhabdo by a viral infection (which are very difficult to detect and basically untreatable anyway), by the flu (takes a while to detect and also untreatable by the time they were looking for it), electrolyte imbalances (impossible to detect after the muscle damage, because the rhabdo itself hides the original issue), or something else entirely. One doctor told me directly that I was a “confusing case,” and several expressed frustration that they weren’t able to narrow it down more fully.

That said, by this morning my CK levels had dropped into the low 30k-range, indicating that the muscle damage wasn’t ongoing. Given that and the excellent condition of my kidneys and other organs, they determined that it was safe to send me home (with the proviso that I drink a ton of fluids and avoid intense exercise for a couple of weeks). I’m to follow up with my primary care doctor for bloodwork on Tuesday and again the following week to confirm that my levels continue to drop – and potentially to see if an electrolyte balance or anything else becomes visible as the rhabdo itself recedes.

So, there you have it: my rhabdo journey. I had a mild-to-moderate case, with no complications and almost no visible symptoms, so I count myself lucky despite the lack of a real explanation about how I ended up with it.

Will I be going back to Crossfit? Absolutely. It seems pretty clear from talking with the doctors that the workout itself wasn’t enough to cause this, and the best way to guard against exertion-caused rhabdo in the future is to continue to improve my overall fitness. I love the people and the supportive atmosphere at Crossfit 919, and I can’t wait to get back there once I’m cleared to exercise again.

Will I be more careful about how hard I push myself and how I eat and drink when I’m sick? Definitely. Even if the respiratory crud I’ve had for the last several weeks or an electrolyte imbalance didn’t contribute to this, it’s a wakeup call that those things are even more important when you’re ill.

Would I eat the grilled chicken caesar salad at this particular hospital again? Nope. The chicken was heavily spiced with cumin, giving the whole dish an unwelcome Tex-mex taste that didn’t go well with the dressing.