Cloud Foundry (CF) is a platform-as-a-service that, once deployed, makes it easy for developers to deploy, run and scale web applications. Powering this elegant PAAS is a complex distributed system comprised of several interoperating components: the Cloud Controller (CC) accepts user input and directs Droplet Execution Agents (DEAs) to stage and run web applications. Meanwhile, the Router maps inbound traffic to web-app instances, while the Loggregator streams log output back to developers. All these components communicate via NATS, a performant message bus.
It’s possible to boil CF down to a relatively simple mental model: users inform CC that they desire applications and Cloud Foundry ensures that those applications, and only those applications, are actually running on DEAs. In an ideal world, these two states – the desired state and the actual state – match up.
Of course, distributed systems are not ideal worlds and we can – at best – ensure that the desired and actual states come into eventual agreement. Ensuring this eventual consistency is the job of yet another CF component: the Health Manager (HM).
In terms of our simple mental model, HM’s job is easy to express:
- collect the desired state of the world (from the CC via HTTP)
- collect the actual state (from the DEAs via application heartbeats over NATS)
- perform a set diff to find discrepancies – e.g. missing apps or extra (rogue) apps
- send START and STOP messages to resolve these discrepancies
Despite the relative simplicity of its domain, the Health Manager has been a source of pain and uncertainty in our production environment. In particular, there is only one instance of HM running. Single points of failure are A Bad Thing in distributed systems!
The decision was made to rewrite HM. This was a strategic choice that went beyond HM – it was a learning opportunity. We wanted to learn how to rewrite CF components in place, we wanted to try out new technologies (in particular: Golang and etcd), and we wanted to explore different patterns in distributed computing.
HM9000 is the result of this effort and we’re happy to announce that it is ready for prime-time: For the past three months, HM9000 has been managing the health of Pivotal’s production AWS deployment at run.pivotal.io!
As part of HM9000’s launch we’d like to share some of the lessons we’ve learned along the way. These lessons have informed HM9000’s design and will continue to inform our efforts as we modernize other CF components.
Laying the Groundwork for a Successful Rewrite
Replacing a component within a complex distributed system must be done with care. Our goal was to make HM9000 a drop-in replacement by providing the same external interface as HM. We began the rewrite effort by steeping ourselves in the legacy HM codebase. This allowed us to clearly articulate the responsibilities of HM and generate tracker stories for the rewrite.
Our analysis of HM also guided our design decisions for HM9000. HM is written in Ruby and contains a healthy mix of intrinsic and incidental complexity. Picking apart the incidental complexity and identifying the true – intrinsic complexity – of HM’s problem domain helped us understand what problems were difficult and what pitfalls to avoid. An example might best elucidate how this played out:
While HM’s responsibilities are relatively easy to describe (see the list enumerated above) there are all sorts of insidious race conditions hiding in plain-sight. For example, it takes time for a desired app to actually start running. If HM updates its knowledge of the desired state before the app has started running it will see a missing app and attempt to START it. This is a mistake that can result in duplicates of the app running; so HM must – instead – wait for a period of time before sending the START message. If the app starts running in the intervening time, the pending START message must be invalidated.
These sorts of race conditions are particularly thorny: they are hard to test (as they involve timing and ordering) and they are hard to reason about. Looking at the HM codebase, we also saw that they had the habit – if not correctly abstracted – of leaking all over and complicating the codebase.
Many Small Things
To avoid this sort of cross-cutting complexification we decided to design HM9000 around a simple principle: build lots of simple components that each do One Thing Well.
HM9000 is comprised of 8 distinct components. Each of these components is a separate Go package that runs as a separate process on the HM9000 box. This separation of concerns forces us to have clean interfaces between the components and to think carefully about how information flows from one concern to the next.
For example, one component is responsible for periodically fetching the desired state from the CC. Another component listens for app heartbeats over NATS and maintains the actual state of the world. All the complexity around maintaining and trusting our knowledge of the world resides only in these components. Another component, called the analyzer, is responsible for analyzing the state of the world — this is the component that performs the set-diff and makes decisions. Yet another component, called the sender, is responsible for taking these decisions and sending START and STOP messages.
Each of these components has a clearly defined domain and is separately unit tested. HM9000 has an integration suite responsible for verifying that the components interoperate correctly.
This separation of components also forces us to think in terms of distributed-systems: each mini-component must be robust to the failure of any other mini-component, and the responsibilities of a given mini-component can be moved to a different CF component if need be. For example, the component that fetches desired state could, in principle, move into the CC itself.
Communicating Across Components
But how to communicate and coordinate between components? CF’s answer to this problem is to use a message bus, but there are problems with this approach: what if a message is dropped on the floor? How do you represent the state of the world at a given moment in time with messages?
With HM9000 we decided to explore a different approach: one that might lay the groundwork for a future direction for CF’s inter-component communication problem. We decided, after extensive benchmarking, to use etcd a high-availability hierarchical key-value store written in Go.
The idea is simple, rather than communicate via messages the HM9000 components coordinate on data in etcd. The component that fetches desired state simply updates the desired state in the store. The component that listens for app heartbeats simply updates the actual state in the store. The analyzer performs the set diff by querying the actual state and desired state and placing decisions in the store. The sender sends START and STOP messages by simply acting on these decisions.
To ensure that HM9000’s various components use the same schema, we built a separate ORM-like library on top of the store. This allows the components to speak in terms of semantic models and abstracts away the details of the persistence layer. In a sense, this library forms the API by which components communicate. Having this separation was crucial – it helped us DRY up our code, and gave us one point of entry to change and version the persisted schema.
Time for Data
The power behind this data-centered approach is that it takes a time-domain problem (listening for heartbeats, polling for desired state, reacting to changes) and turns it into a data problem: instead of thinking in terms of responding to an app’s historical timeline it becomes possible to think in terms of operating on data sets of different configurations. Since keeping track of an app’s historical timeline was the root of much of the complexity of the original HM, this data-centered approach proved to be a great simplification for HM9000.
Take our earlier example of a desired app that isn’t running yet: HM9000 has a simple mechanism for handling this problem. If the analyzer sees that an app is desired but not running it places a decision in the store to send a START command. Since the analyzer has no way of knowing if the app will eventually start or not, this decision is marked to be sent after a grace period. That is the extent of the analyzer’s responsibilities.
When the grace period elapses, the sender attempts to send a START command; but it first checks to ensure the command is still valid (that the missing app is still desired and still missing). The sender does not need to know why the START decision was made, only whether or not it is currently valid. That is the extent of the sender’s responsibilities.
This new mental model – of timestamped decisions that are verified by the sender – makes the problem domain easier to reason about, and the codebase cleaner. Moreover, it becomes much easier to unit and integration test the behavior of the system as the correct state can be set-up in the store and then evaluated.
There are several added benefits to a data-centered approach. First, it makes debugging the mini-distributed system that is HM9000 much easier: to understand what the state the world is all you need to do is dump the database and analyze its output. Second, requiring that coordination be done via a data store allows us to build components that have no in-memory knowledge of the world. If a given component fails, another copy of the component can come up and take over its job — everything it needs to do its work is already in the store.
This latter point makes it easy to ensure that HM9000 is not a single point of a failure. We simply spin up two HM9000s (what amounts to 16 mini-components, 8 for each HM9000) across two availability zones. Each component then vies for a lock in etcd and maintains the lock as it does its work. Should the component fail, the lock is released and its doppelgänger can pick up where it left off.
We evaluated a number of persistent stores (etcd, ZooKeeper, Cassandra). We found etcd’s API to be a good fit for HM9000’s problem space and etcd’s Go driver had fewer issues than the Go drivers for the other databases.
In terms of performance, etcd’s performance characteristics were close to ZooKeeper’s. Early versions of HM9000 were naive about the amount of writes that etcd could handle and we found it necessary to be judicious about reducing the write load. Since all our components communicated with etcd via a single ORM-like library, adding these optimizations after the fact proved to be easy.
We’ve run benchmarks against HM9000 in its current optimized form and find that it can sustain a load of 15,000-30,000 apps. This is sufficient for all Cloud Foundry installations that we know of, and we are ready to shard across multiple etcd clusters if etcd proves to be a bottleneck.
In closing, some final thoughts on the decision to switch to Golang. There is nothing intrinsic to HM9000 that screams Go: there isn’t a lot of concurrency, operating/start-up speed isn’t a huge issue, and there isn’t a huge need for the sort of systems-level manipulation that the Go standard library is so good at.
Nonetheless, we really enjoyed writing HM9000 in Go over Ruby. Ruby lacks a compile-step – which gives the language an enviable flexibility but comes at the cost of limited static analysis. Go’s (blazingly fast) compile-step alone is well worth the switch (large scale refactors are a breeze!) and writing in Go is expressive, productive, and developer friendly. Some might complain about verbosity and the amount of boiler-plate but we believe that Go strikes this balance well. In particular, it’s hard to be as terse and obscure in Go as one can be in Ruby; and it’s much harder to metaprogram your way into trouble with Go ;)
We did, of course, experience a few problems with Go. The dependency-management issue is a well-known one — HM9000 solves this by vendoring dependencies with submodules. And some Go database drivers proved to be inadequate (we investigated using ZooKeeper and Cassandra in addition to etcd and had problems with Go driver support). Also, the lack of generics was challenging at first but we quickly found patterns around that.
Finally, testing in Go was a joy. HM9000 became something of a proving-ground for Ginkgo & Gomega and we’re happy to say that both projects have benefited mutually from the experience.