Monday, November 14, 2016

How to run Continuous Integration on EC2 without breaking the bank

I live in a world increasingly made up of services (sometimes micro-), that collaborate to produce some visible behavior. This decomposition of software into (hopefully) loosely coupled components has implications for production infrastructure concern such as how many servers or containers you need to run. 
Consequently, it impacts also testing infrastructure as it tries to be an environment as close as possible to production, in order to reliably discover bugs and replicate them in a controlled environment without impacting real users.
In particular I like to have a ci testing environment where each service can be tested in its own virtual machine (or container for some); and each node is totally isolated from the other services. In addition to that, I also like to have and end-to-end testing environment where services talk to each other as they would in production, and where we can run long and complex acceptance tests. 
This end-to-end environment usually is a perfect copy of production with respect to the technologies being used (e.g. load balancers like HAProx or AWS ELB are in place, in the same way of production, even if there is no test that directly targets their existence); the number of nodes per service is however reduced from N to 2, as in computing there are only 0, 1 and N equivalence categories.

In the likely case that you're using cloud computing infrastructure to manage this 2x or 3x volume of servers with respect to the production infrastructure, your costs are also by default going to double or triple. One option to try and optimize this is to throw away everything and start deploying containers, as they could share the same underlying virtual machines as production while preserving isolation and reproducibility. On an existing architecture made up of AWS EC2 nodes however, optimization can takes us far without requesting to rewrite all the DevOps(TM) work of the last two years.

Phase 1: expand

As I've been explaining, EC2 instances replicating production environments can expand until they bring the total number of EC2 nodes to three times the original number. Some project just have an single EC2 node, while others have multiple nodes that have to be at least 2 in the latest testing environment before production. Moreover, the time the tests take to run on these instances is inversely correlated with how powerful they are in CPU and I/O terms, so you pay good money for every speed improvement you want to get on those 20- or 60-minute suites.
In my current role at eLife, we initially got to more than 20 EC2 instances for the testing environments. This was beneficial from a quality and correctness point of view, as we could then run tests on all mainline branches before they go to production, but also on pull requests, giving timely feedback on the proposed changes without requiring developers to run everything on their machines (that should be an option, not an imperative.)

Phase 2: optimize

The AWS EC2 pricing model works by putting nodes into a running state when you launch them, and by pre-billing one hour of usage every 60 minutes. Therefore, booting existing instances or creating from scratch is going to incur at least a 1-hour cost for each of these events:
  • at boot, 1 hour is billed
  • at 60:00 from boot a new hour is billed,
  • at 120:00 from boot a new hour is billed, and so on
All the EC2 nodes that we have however, use EBS disks for their root volumes. EBS is the block remote storage provided by AWS, and while some generations of instances use local instance storage for their / partition, EBS makes a lot of sense for that partition as it gives you the ability to starting and stopping instances without losing state in between; essentially, it gives you the ability to shutdown and reboot instances when you need without paying EC2 bills for the time in which they are stopped. The only billing being performed is for the EBS storage space, which means AWS has some hard disks in its data centers that has to preserve your instances files, but it allocates new CPU and RAM resources as a virtual machine only when you start the EC2 instance. Therefore, an EBS-backed EC2 instance on your AWS console does not always correspond to a physical place, but it's really virtual as it can be stopped and started many times, moving between racks in the same availability zone while keeping the same data (persisted to multiple disks in other racks) and even private IP address.
Since the process of allocating a new virtual machine and reconfiguring networks to connect everything together has some overhead, this reflects in the boot time necessary to start an EC2 instance, which can be several seconds to be added to the standard boot time for the operating system. This is also reflected in the pricing model, which makes it a bad idea to launch a new EC2 instance for every test suite you need to run: as long as your test suite takes less than 1 hour, you are already paying for a full hour of resources and you would be throwing away. Running 6 builds in an hour would make you pay for 6 hours, which is not what you want.

Phase 2.1: stop them manually

A first optimization that can be performed manually is to stop and start these instances from the console. You would usually stop them at some hour of the day or evening and then start them again as the first thing in the morning.
Of course, there is a good potential for automating this as AWS provides many ways to access EC2 with its API and all the SDKs that build on top of it, available for many different programming languages. You can easily build some commands that query the EC2 API looking for an instance basing on its tags, and then issue commands for starting and stopping it. In general, this is almost transparent for the CloudFormation templates that you are surely using for launching these instances.
The first time you start and stop an instance, there are a few problems thay may come up.
The first problem is that of ephemeral storage: as I wrote before, you have to make sure the root volume of the instance and any data you want to persist are EBS-backed and not local instance storage.
The second problem is that of public IP addresses. While private IP addresses inside a VPC stay the same after a stop and start commands, public IP addresses are a scarce resource and are only allocated to it when the instance is running. Therefore, if you had a DNS pointing to it, it has to be updated after the boot, whether it was manually created or part of the CloudFormation template. Default DNS entries have the form ec2-public-ip-address.compute-1.amazonaws.com which depends on the public ip address and hence does not provide a good indirection.
The third problem is that of long-running processes managed by SysV/Upstart/Systemd: the daemons of servers like Apache, Nginx or MySQL are usually configured to restart upon boot, but if you have written your own deamons or Python/PHP long running processes and are starting them through /etc/init or /etc/init.d configuration, it pays to check everything is in its place again after boot.
The last problem I have found at this level (manual restarts) is about files in /run and /var/run, which are temporary directory used by deamons to place locks and other transient files like a pidfile indicating an instance of that program is running. If you have folders in /run or /var/run, those folder will have to be recreated. Systemd provides the tmpfiles.d option which automatically creates a hierarchy of files, but it's usually just easier (and portable) to have the daemons create their folders (php-fpm does that) or if they are not able to do that, not placing them in /var/run/some_folder_that_will_stop_existing but in /var/run or even /tmp without subfolders.

Phase 2.2: start them on demand

Instead of manually starting EC2 instance or to automate their stopping and starting as a periodical task, you can also start them on-demand as needed by the various builds that need to be run. So whenever project x needs to build a new commit on master or a pull request, you will start the x--ci EC2 instance.
In this case, however, there is a larger potential for race conditions as you may try to run a deploy or any command on an instance before it's actually ready to be used. Therefore, we wrote some automation code that waits for several events before letting a build proceed:
  1. the instance must have gone from the pending to the running state on the EC2 API. This hopefully means AWS has found a CPU and other resources to assign to it.
  2. the instance must be accessible through SSH.
  3. through SSH, we monitor that the file /var/lib/cloud/instance/boot-finished has appeared. This file will appear at each boot when all daemons have been started, as art of the standard cloud-init package.
Once the instance has gone through all these phases, you can SSH into it and run whatever you want.

Phase 2.3: stop them when it's more efficient

Once you have transitioned from starting instances in the last responsible moment, you can do the same for stopping them instead of just wait for the end of the day to shutdown everything.
We now have a periodical job, running every 2 minutes, that takes a list of servers to stop. In parallel, it performs the following routine for each of the EC2 instances:
  • checks if the server has been running for an amount of time between h:55:00 and h:59:59 minutes, where h is some number of hours.
  • if the condition is true, stop the instance before we incur in a new hour being billed.
  • otherwise, leave the instance running: you already paid for this hour so it makes no harm to do so, as the instance can be used to run new builds at no cost.
Therefore, when developers open a dozen pull requests on the same project, only the first starts the necessary instances to run the tests; the other ones are queued behind that and will get access to the same instance, one at a time.

Bonus: Jenkins locks

Starting and stopping instances periodically would be otherwise dangerous if there wasn't a mechanism for mutual exclusion between builds and lifecycle operations like starting and stopping. Not only you don't want to run builds for the same project on the same instance if they interfere with each other, but you definitely don't want an instance to be shutdown while a build is still running.
Therefore, we wrap both these lifecycle operations and builds in locks for resource, using Jenkins Lockable Resources plugin. If the periodical stopping task tries to stop an instance where the build is running, it will have to wait to acquire the lock. This ensures that machines that see many builds do not get easily stopped, while other ones that are idle will be stopped at the end of their already paid hour.

Conclusions

Cloud computing is meant to improve the efficiency with which we allocate resources from someone else's data centers: you pay for what you use. Therefore, with a little persistence provided by EBS volumes you can efficiently pay for the hours that your builds require, and not for keeping idle EC2 instances running every day of the year. Of course, you'll need some tweaking of your automation tools to easily start and stop instances; but it is a surefire investment that can usually save more than half of the cost of your testing infrastructure by putting it at rest in weekend and non-working hours.