Friday, December 28, 2018

Delivery pipelines for CDNs

https://www.fastly.com/network-map
In the last couple of years I have integrated Content Delivery Networks into various eLife applications, managing objects ranging from static files and images to dynamic HTML. These projects mainly consisted of:
  • implementing Infrastructure as Code for these CDNs inside the Github repositories we already use for all other cloud resources (AWS and GCP)
  • effectively authorize HTTPS on the CDN side, which will be impersonating your origin servers
  • create instances of the same CDN services, first in testing and then in production environments, keeping them in parity with each other
  • expand end-to-end testing (the tip of the pyramid) to cover also the CDNs rather than just covering the applications involved
  • integrate logging in order to catch any problem happening between the user and the origin servers
  • finally phase in the new CDNs with new geotagged DNS entries
Our first implementation from 2016 was widely integrated into AWS and as such CloudFront was the chosen solution. We subsequently switched to Fastly for all ordinary traffic, experiencing a general increase in features, customization and expenses. What follows is a comparison that isn't just meant to orient the reader between CloudFront and Fastly, but also against the third option of not using a CDN at all. In fact, there are many concerns that may be glossed upon but that you need to take into account seriously when you move your web presence from a few origin servers to a global network of shared, locked down servers managed by an external organization.

Infrastructure as Code

Our AWS-based setup is making a large use of CloudFormation, the native service for declaratively specifying resources such as servers, load balancers and disks. The simple setup has been augmented over the years by a code generation layer for the CloudFormation templates; this Python code reduces duplication between the various templates by starting from standard EC2/ELB/EBS resources that can be customized in size and other parameters.
If we start from a simple single-server setup for a microservice (this was before Docker containers got stable enough), we are looking at a template containing at least an EC2 instance and a DNS entry pointing to it. With multiple servers, we expand this with a load balancer that pulls in a TLS certificate provided to IAM by an administrator.
To configure CloudFront via CloudFormation, an additional resource for the CDN distribution is introduced. All the configuration you need will be visible in this resource, a JSON dictionary or XML tag respecting a certain schema.
Since CloudFormation can only manage AWS resources and nothing outside that tended garden, Fastly was the reason for introducing Terraform alongside it. Whereas almost anything AWS-specific still goes through CloudFormation, Terraform has opened up new roads such as Infrastructure as Code implementations for Google Cloud Platform (storage buckets and BigQuery tables).
Applying changes in this context is not trivial as you may inadvertently reboot or destroy a server while believing you were only changing a minor setting. Yet Infrastructure as Code is about making the current state of infrastructure and all changes visible, easy to review and safe to rollout across multiple environments. It is imperative therefore to maintain testing environments created with the same tooling as production, and to use them to ultimately integration test all changes.
The caveat of using multiple tools in lockstep for the same instance of a project (including servers, cloud resources and CDNs) is that they can't declare dependencies between resources managed by different tools. For example, since we manage DNS in CloudFormation and Fastly CDNs in Terraform, we can both at the same time but can't couple together the existence of a DNS and the CDN it points to, or impose a creation or update order that is different from the general order we run the tools in.
The most glaring difference in updates rollout between the various options is that, to rollout a CDN configuration change, it takes:
  • no deployment time if you don't use a CDN (obviously)
  • 10s of seconds for Fastly
  • 10s of minutes (up to 1 hour was common) for CloudFront
This means Fastly opens up the possibility for experimentation, even if with a slower feedback that your local TDD cycle. With CloudFront this is painful and haphazard as you decide on a change, start applying it and come back one hour later to check its effects, after having already switched to another task.
Still, minutes of update and/or creation time make Fastly unavailable for inclusion in the CI environments where the tests of a single service are run. You could in theory create a Fastly service on the fly when the build of the service runs, but this will add minutes to your build _and_ also promote coupling to the CDN itself. Fast forward this a bit and you'll see an application unable to be run locally anymore for exploration because of the missing CDN layer. Therefore, like cloud services the CDN is treated like a long-lived resource, with its regression testing performed into a shared environment on every new application commit, but after merge.

Logging

Within a web service, you usually have some kind of access log being generated by nginx or Apache. These logs can sit on a single server or can be uploaded to some aggregation point, whether it is a local Logstash or an external platform that can index them.
Even load balancing doesn't change this picture very much as the load balancer(s) logs should be identical to the ones of the application servers if everything is working well. But with a CDN, large-scale caching is introduced and so it's plausible that you will stop directly seeing a large percentage of your traffic. Statistics or monitoring based on access logs may get skewed; or worse, Japan may be cut off from your website for a while because the health checks from the CDN points of presence there have a timeout of a few milliseconds too low to get to your servers in us-east-1 (of course this never happened).
Hence, to understand what's going on in those few hundred servers you have no access to, you need a way to stream them to some outsourced service; this can be storage as a service (S3 or GCS) or directly some log infrastructure provider. The latency with which logs can get in the right place is a key metric of the feedback loop from changes.
Since we are striving for Infrastructure as Code, all the logging configuration should be kept under version control together with hostnames and caching policies. We got to a standard logging format (JSON Lines with certain fields) and frequency, along with GCS bucket where to put new entries, bucket names following conventions. This was later expanded into BigQuery tables providing queries over the same data, after the Terraform Fastly provided started supporting this delivery mechanism.
The main difficulty in integration was credentials management: you aren't told much if credentials are not correct or not authorized to perform certain actions like writing to BigQuery. Moreover, you can't just commit a bunch of private keys for anyone to see, especially since Infrastructure as Code repositories tend to be made very visible to as many people as possible.
We ended up putting GCP credentials and similar secrets in Vault, running on the same server as the Salt master (same thing as Puppet master). The GCP Service Account itself and its permissions to write to the bucket needed some special permissions to set up (it's turtles all the way down) so couldn't put it directly into Infrastructure as Code but had an admin manually creating it instead. The ideal thing would be for Vault to generate credentials by itself, following the pattern of periodically rotating them. But then it would need to push these credentials somehow into the Fastly configuration and I'm here to provide efficient delivery pipelines, not make cloud giants wrestle.

Flexibility

Your own application is usually highly customizable, with a certain cost associated. You have to write some code in your favorite programming language, possibly following some framework conventions and calling your classes Middleware or EventListener.
CDNs work on shared servers, so they have limits on what can be safely run in that sandboxed environment. Nevertheless, Fastly provides the possibility to customize the VCL that runs each service with your own snippets and macros.
This is very flexible, perhaps even too much: you can introduce headers with random values, write conditionals and implement loops by restarting requests. It feels similar to working in nginx configurations but with a more predictable language.
The main problem with this form of customization is that there is no way to run it or test it on your own. The best feedback loop we found is the Fastly Fiddle (similar to JS Fiddle) where you test out bits of code, hit a save button and see it propagated to servers around the world for you to test.
The fact that this even exists is impressive, but you can imagine how well it works for actual development. Once you get past experimenting, you can't integrate a Fiddle with your own Infrastructure as Code approach (e.g. Terraform templates) nor easily port code from one to another besides copying and pasting. You can run integration-only tests in some other window, but the feedback loop can't be shorter than the deployment time; unit tests are not a thing. You can't even use your IDE as much as you may love it. In the end, Fastly's Varnish diverged from the open source one 4 major versions ago; hence, this VCL is a proprietary language and you'll feel the same as writing stored procedures in Oracle's PL/SQL.
I tend to see VCL and other intermediate declarative templates (such as Terraform .tf files) as a generation target for Infrastructure as Code to compile to. This lets you unit test that your tools generate a certain output for these templates; use dummy inputs in tests and check dummy expected outputs; all of this will still need to be integration tested with the application itself in a real environment, but some of the responsibilities can be developed in the tool itself and reused across many applications.

Integration testing

We have understood by now that to keep the ensemble of servers, code, cloud services and CDNs we need some automated integration testing in place that touches all the different pieces. We don't want many scenarios to be tested at this level because it's slow and brittle to do so, but we need a tracer bullet that goes through everything, if only to verify all configurations are correct.
In the general context of outsourcing of responsibilities to a service or a library, you still own it as a dependency of your application and still need to verify the emergent behavior of custom code and borrowed architecture.
Therefore, I always put at least a staging environment in place replicating production where automated tests can run. This doubles as the place where to try and roll out infrastructure updates that are risky (which are risky? If you have to ask, all of them; just roll out everything through staging).
As we have seen, creating too many different, ad-hoc environments to test pull requests doesn't scale; this will reach death by feature branch as all of your Jenkins nodes are waiting for yet one more RDS node or CloudFront distribution to be created.
A common example of a coupled, integration-related feature to test is the forwarding of Host and other headers; these go through so many layers: a couple of CDN servers, a load balancer, an nginx daemon and finally the application. Some headers don't just have to be forwarded, but have to be rewritten or renamed or added (X-Forwarded-For). All of this can in theory be specified for every single layer but testing the whole architecture probably makes for easier long-term maintenance.

Why?

In various projects you always have to ask yourself why you are doing something (especially complex things) and what value you want to get out of it. CDNs are one of the go-to solution for web performance, their killer feature being huge caches for slow-changing HTML and assets across the world so that even a casual Indian reader can load your homepage in one second. Moreover, if done right the load on your origin servers will also be greatly reduced with respect to not using caching layers.
On the other hand, you can see the complexity, observability and maintenance needs that every additional layer introduces. When asking whether a CDN should do something or your application should do something, it's the same decision as for a database or a cloud service: how can you effectively store and update its configuration in multiple environments? Do you want to oursource that responsibility? How will you know when something's wrong? Do you feel comfortable writing stored procedures in a language you can't run on your laptop? All of these are architectural questions to go through when evaluating various CDNs, or no CDN.

Thursday, December 06, 2018

Book review: The 5 Dysfunctions of a team

https://en.wikipedia.org/wiki/The_Five_Dysfunctions_of_a_Team
This is a spur-of-the-moment review of The 5 Dysfunctions of a Team, a business novel on team health that I've read today as part of the quarterly Professional Development Days I take as part of working at eLife.

As a follow-up to my role evolution into Software Engineer in Tools and Infrastructure, I am looking again more into the people skills side of my job (as opposed to purely technical skills). I have done this cyclically during my career, as the coder hat becomes too restrictive and you have to pick up other tools to achieve improvement. In particular, I am working on eLife's Continuous Delivery platform and it is crucial to work with multiple product-oriented teams to have them adopt your latest Jenkins pipelines and Github reports.

Dysfunctions

Patrick Lencioni's model of team dysfunctions (or of blessed behaviors if you flip all definitions) is a pyramid where each dysfunction prevents the next level from being reached. It would be a disservice to how well and quickly this is got across in the novel to just try to list them here, but if I had to summarize this in a long paragraph it would look like:
Building trust between team members allows constructive conflicts, which enable people to commit to action and hold each other accountable for what has been decided; all in the service of results. -- not really a quote
The dysfunctions are the flip side of these positive behaviors, for example lack of trust or fear of conflict. The definitions of some of these terms are more precise than what you find in many Agile and business coaching books; so don't dismiss trust as just a buzzword, for example.

Some context

The case being treated in the book, which comes from the author's management consulting firm, is that of a CEO turning around a team of executives. This makes for a somewhat more fascinating view of results as it's talking about a (fictional) company's IPO or eventual bankruptcy. Despite not being a clear parallel to a software development team, I do think this is applicable in every situation where professionals are paid to work daily together, with some caveats.
In fact, I suspect the level of commitment to the job that you see in the book would be typical of either high stakes roles (executives) or a generally healthy organization that has already removed common dysfunctions at the individual level. If in your organization:
  • people are primarily motivated by money
  • they look forward to 5 PM
  • they browse Facebook and Twitter for hours each day
then there are personal motives that have to be addressed before teams can start thinking about collective health.

Yes but, what can I do in practice?

After the narrative part, an addendum to the book contains a self-administered test to zoom in on which possible dysfunctions your team may exhibit at the moment. It continues with a series of exercises and practices that address these topics, with an estimation of their time commitments or how difficult they would be to run. I definitely look forward to anonymously try out the test with my technical team, out of curiosity for other views.

My conclusions

I think most of the dysfunctions are real patterns, that can only be exacerbated by the currently distorted market for software developers and CV-driven development. The last dysfunction, Inattention to results, is worth many books on its own on how to define those results as employees at all levels are known for optimizing around measurable goals to the detriment of, for example, long-term maintenance and quality.
So don't start a crusade armed by this little book, but definitely keep this model in your toolbox and share it with your team to see if you can all identify areas for collective improvement; it is painfully obvious to say you can't work on this alone!
The author is certainly right when writing that groups of people truly working together can accomplish what any assembly of single individuals could never dream of doing.

Sunday, September 16, 2018

Eris 0.11.0 is out

Eris 0.11.0 has been freshly released, and I'll be listing here various contributions that the project has received that are included in this new version and in the previous one, 0.10.0, which didn't have an associated blog post.

For a full list and links to the relevant pull requests and commits, see the ChangeLog.

0.10

  • The Eris\Facade class was introduced to allow usage outside of a PHPUnit context.
  • Official PHPUnit 7 support was introduced.
  • Fixed a corner case in suchThat()
There are some small backward compatibility breaks with respect to 0.9; they regard unused features (or at least I thought) including Generator::contains().

0.11

  • Official PHP 7.2 support
  • Annotations support for configuring behavior that is usually configured through methods: @eris-method, @eris-shrink, @eris-ratio, @eris-repeat, @eris-duration

Some acknowledgements

Most of this work comes from contributions, not from me. I'd like to say a word of thanks to the people that have taken the time to use Eris in some of their projects but also to feed back a fix, an extension, or a substantial improvement.

Sunday, January 21, 2018

Book review: Production-ready microservices

https://www.amazon.co.uk/Production-Ready-Microservices-Standardized-Engineering-Organization/dp/1491965975
Production-Ready Microservices is a short book about consistently practicing architecture and design over a fleet of microservices.
In general, I think the principles described here apply very much to any service-oriented initiative, even more so if the services are coarse grained and hence require more maintenance than finely isolated ones.

Uber

The book extrapolates from the author's experience at Uber "standardizing over a thousand microservices". Given a few developers for each microservice team, that makes up 2000-3000 engineers from the total >10000 Uber employees (I wonder how many are lawyers). After WhatsApp's famous story of being acquired at 55 employees in total, that really highlights the difficulty level of running a business and operations all over the physical world (sending cars and drivers around in dozens of countries) with respect to a digital-only enterprise. We should remember this and many other directions of change the next time we hear a technology advocate saying how much the cost of his 2-people startup has been reduced by $technology.

The main message

You should be this tall to use microservices; this architecture doesn't necessarily fit every context; although integrating separate services of some size is becoming a standard after the API revolution (before that, it was integrate through the database which is arguable worse).
You will encounter many different social and technical problems, such as:
  • Inverse Conway's Law, with the shape of the products defining the shape of the company. Although I found out this doesn't really apply at smaller scales as development teams can own more than one service and experience a successful decoupling between people and code.
  • Technical sprawl, where multiple languages, databases and other key choices spread without a consistent, central planning.
  • More ways to fail: distributed and concurrent systems are more difficult to work with and to reason upon, plus the fact that there are more servers, containers or applications will simply multiply the failures you'll see.
There are lots of non-functional requirements like scaling each microservice and isolating it from the rest of the fleet; perhaps don't go too micro- if you don't have the resources to ensure an acceptable level in each service. Perhaps in your context the acceptable SLA for some particular service is low, because it's not change often or is only internally facing or is only used several times per day.
One particular aspect of the Consistency is important lesson is that the whole lifecycle of services should be considered. Maintenance and even decomissioning are as important as producing new MVPs: but I've seen many times services being neglected, or being considered very easy to migrate away from one some new shiny substitute was available. In reality, it takes time and effort to keep services up and running, and to finally kill them when you have an alternative, as data and users are slowly migrated off from the old to the new platform.
Lots of requirements are also overlooked but often turn out to be important as you increase your population of services: the scalability of a single endpoint, fault tolerance, even documentation (ADR are the only form I trust very much right now in a fast-moving organization context.) Every single section of this book will make you think about it, but won't give much of an overview: you're better served by reading the SRE book for example.

Value for money

This book is a short read which gives you an overview of what microservices challenges you're likely to face down that rabbit hole; in particular, it focuses on a medium-to-large organization context. I'm not sure this book is worth the price tag however: 20 pounds for a Kindle edition of ~170 pages, where ~25 pages are glossary, index and lots of checklists.

Wednesday, January 03, 2018

Book review: Algorithms to live by

https://www.amazon.com/Algorithms-Live-Computer-Science-Decisions/dp/1627790365 Algorithms to Live By: The Computer Science of Human Decisions is a book that puts together the domains of computer science and real life. The ensemble of topics being touched is wide. The book treats deterministic algorithms such as optimal sorting, but then moves on to more context-dependent strategies for caching and scheduling. The last chapter even get to model identification, (tractable and intractable-made-tractable) optimization problems, stochastic algorithms and game theory.

All the while, computer science concepts are compared to conscious and unconscious human processes. For example, caching and the memory hierarchy have great parallels with how the human brain recollects memories of recent events, and how we can augment our brain with external, slower supports like paper. Scheduling is useful not only to allocate processes on CPU cores, but also to make an explicit choice of strategy when prioritizing the tasks that you or your team face. Up to the more extreme examples of game theory and mechanism design, when the incentive system becomes more important than the individual agents (manage the system, not the people rings a bell?)

If you like viewing the world throughout the lens of algorithms and see how the strategies of humans and computer compare with each other, I would strongly recommend this book as it will make for an entertaining read and some principles to take away for real life usage (I hope sorting socks will be easier now). Skip it if you have a very wide knowledge of computer science, operations research, Nash equilibria... but even if I was familiar with the technical part, I was missing the connection to different domains or everyday, real world problems. I listened to the audiobook version, which lasts about 12 hours. You may find it easier to skim through some chapters if you are more (or less) interested in some topics. The problem with audiobooks is that I can't easily take notes, while highlighting on an e-book reader is quick and lets me recollect all important gotchas later into a text file.

New role: Software Engineer in Tools and Infrastructure

After working on eLife's testing and deployment infrastructure in 2016, in the last year my responsibilities in the technical team have shifted towards the domain of engineering productivity. Testing is one phase of the development process that is often a bottleneck, but there are many more areas like code reviews, monitoring and infrastructure itself (being it servers or services):
In summary, the work done by the SETs naturally progressed from supporting only product testing efforts to include supporting product development efforts as well. Their role now encompassed a much broader Engineering Productivity agenda. -- Ari Shamash on the Google Testing Blog
Moreover, the team starts from a high level of coverage and design on many projects, to the point that my focus has always been on the provisioning and automation of testing environments, and on large-scale end2end testing.

What seems just a letter on a job title (from SET to SETI) is in fact an alignment of responsibilities so that I am not accidentally mistaken for "the QA guy" but always seen as a problem solver instead.
https://en.wikipedia.org/wiki/Pulp_Fiction
Solving problems and propagating the solution, so that you don't have to solve them over and over again
Roles are always an approximation in a team of generalizing specialist that also distributes and collaborate on some roles such as that of architecture. But it's helpful in a cross-functional team to have someone dedicated to the task of productivity, whether it is reached through automation, tooling, or continuous improvement.

ShareThis