Data Centre Migration: Trends and Adoption of Technologies
By Alex Hooper, Head of Operations, BMJ
During 2017, we built out substantial infrastructure for the development and re-launch of a major product which has moved from monolith to micro services. Here, we provisioned faster than ever before, and were able to destroy, recreate, and resize and generally tune infrastructure at the drop of a hat.
The size of our estate has nearly tripled over this time but the size of the team has stayed constant.
We (my team of four Ops Engineers) could never have achieved this if we had not been pursuing a strategy over this period of adopting emergent cloud technologies, particularly around the automation of provision, orchestration, and configuration, coupled with the use of hybrid cloud.
Adoption of these technologies has, almost without exception, been an extremely positive process--easy to learn, well-documented, exciting to use, and bringing tangible, measurable benefit to the business. The downsides tend to be those that generally affect new technologies: the closer you get to the cutting edge, the less stable the tools are: we have had, on occasion, to re-work when a tool’s upgrade breaks backwards compatibility. But this risk can generally be managed by common sense: pulling knowledge from the tech teams, from colleagues and peers and letting that guide strategy and adoption. And the discovery process in itself can feed a bonding between dev and ops teams which is critical on this path.
This cultural change--the move towards the much-vaunted “DevOps” culture--is a welcome and necessary partner to adopting a highly automated cloud strategy, and reaps rewards in time-efficiencies, trust, and transparency.
Let’s focus in on an example. Last year, we ran a project to migrate all our development and test environments from an in-house data centre run by our parent, BMA, out to public cloud. The kind of thing which, only a handful of years ago would have been massively intrusive and disruptive to BAU but which, thanks to hybrid cloud, we were able to do with virtually no impact on the developers or rest of the business. With a VPN between our in-house DC and our externally-hosted private cloud, and then Direct Connect from there to our AWS environment, we were able to extend the logical network across the lot and then move individual components between DCs without breaking dependencies. Larger, more tentacled systems could remain in the original DC servicing the applications that we moved out to AWS. Skills were honed on the simpler pieces and those larger systems were tackled once confidence had built. There was constant room for celebration as applications were migrated--rather than the constant sense of dread that hangs over an “everything at once” move.
Adoption of these technologies has, almost without exception, been an extremely positive process: easy to learn, well-documented, exciting to use, and bringing tangible, measurable benefit to the business
This migration had a number of more concrete advantages, most significant among which were cost-savings and time-to-delivery improvements for infrastructure requests. As our developers are UK-based, these dev/test environments were only required during UK working hours. In pretty short order, our Ops engineers were able to knock up some scripts that leveraged the AWS APIs in order to shut down and restart our EC2 and RDS instances on schedule, cutting cost by about two thirds.
Another immediate win was in the reduction of time it would take to provision or resize infrastructure. Even though the in-house environments had been virtualized, it still took a ticket and several handoffs to get anything done--new infrastructure would take days to provision, and even changing the amount of RAM or CPU could take just as long. Once in public cloud, re-sizing was done in minutes, with provisioning not far behind.
Risks? Well, iterating your way to the future implies having to run hybrid systems, or separate systems that necessarily increase complexity during the transition period. I generally have found this risk to be workable--documentation is important, as is its peer-review or intra-team discussion. It needn’t be highly formal--it is more important that it is easily created and collaborated on--I find a wiki works well. It also helps that the systems being adopted are software-based and API-driven and thus are to a large extent self-documenting as they exist in your source-code repository, are driven by your CI chain, etc. And, of course, this puts Operations work into a landscape well-known by the developers, therefore helping open up the knowledge to them.
You do need good people--get people with nous and passion. CVs really say very little in this space, I find. These techs are hot at the moment and its practitioners attract a premium rate and it can be a challenge to find a good recruitment path. People with real-world experience in this field will be expensive. I have found, though, that you can play this to your advantage--create an environment where young talent can learn these techs; expect a one or two-year tenure and use that to continually validate that documentation and knowledge-sharing is working. You must avoid creating any human single points of failure in your teams.
It is also imperative you build in instrumentation, monitoring, and a way to view the metrics you are pulling back. Prometheus, Grafana, Logstash, Kibana, Cloudwatch--these and other tools are easily integrated into automation and are essential for managing and understanding your estate.
The future? Our next major initiative is the Infrastructure as Code (IaC) project. End-goal? We can look at the orchestration code in Github and know that that represents the current state of our infrastructure. This process, in a somewhat infant stage, already informs our development environment, with greenfield projects having their infrastructure and networking created via Terraform. There are massive advantages to be had here. Consistency is much easier to achieve, especially when the orchestration code can be modularised, and this reduces dramatically the rate at which technical debt accumulates. Disaster recovery comes almost as a side-effect, as the code can, eg, be pointed at a separate AWS account or even (modulo a little refactoring) another cloud provider. Cost savings are greater because we are not limited to shutting down those components that allow it but can destroy the entire infrastructure overnight and create it all again in the morning, or when needed. Making this destruction/recreation a part of standard working practice renders another huge advantage in that the correctness of the code is continuously tested and validated--by the time that same code is used to create staging and live infrastructure, we will know it is robust and correct.
It’s not fully integrated with CI yet but that’s only an iteration away and, actually, there’s something positive about the fact that the developers have to log on to the system and issue a command or two. As previously noted, our cloud initiatives are naturally encouraging the development of a devops culture in BMJ which is already bringing extra efficiency: devs and ops are becoming more like a single team, with the divide blurring.
One of the reasons we chose Terraform as our Infrastructure as Code tool is its non-partisan nature. We use AWS but it should be a relatively simple task to switch to Google Cloud, or another provider. Our roadmap currently has us looking next at containerisation--largely because this will give us even greater portability and thus expand choice--and serverless, which we are already adopting at the periphery for event-driven services where load is relatively light.