DevOps Awards winner Improbable on “unleashing the full power of the cloud”
Lead Software Engineer
Improbable Worlds Limited — commonly known as Improbable — is a metaverse technology company that’s been at the forefront of building virtual worlds since 2012. With a world-class team, Improbable creates immersive gaming and event experiences using its Morpheus Technology, allowing over 15,000 users to interact as if they were in the same place at the same time. In this blog post, we’re highlighting Improbable for the DevOps achievements that earned the company the ‘Unleashing the Full Power of the Cloud’ award in the 2022 DevOps Awards. If you want to learn more about the winners and how they used DORA metrics and practices to grow their businesses, start here.
Video game builds traditionally require choosing between high infrastructure costs or longer wait times for developers and other downstream processes, but neither is tenable when you are trying to bring together tens of thousands of users in one single virtual environment. Rapid prototyping and QA are essential to the games industry or companies building virtual worlds, so developers must get working builds and deployments out as soon as possible to be able to validate, gather feedback, iterate, and try again. A single build failure can block the work and testing of hundreds of individuals — waiting hours on a fix is not an option — so the systems we provide must be fast and reliable.
Beyond just speed, scalability and stability were becoming major issues as Improbable’s original static and inflexible system had to adapt to a rapid expansion with more aggressive deadlines and an intense growth in daily build requirements. With an old infrastructure relying on tightly integrated systems, even upgrades and new features could lead to a failure in a single, small system, which could lead to an outage of the company’s entire service.
Meeting customer needs
To meet the growing demands, our organization saw that we needed to address both technological and process-based challenges. To meet customers’ needs, we needed a purpose-built infrastructure for Windows Metaverse (Game) development that was fast, cheap, highly reliable, and highly scalable. With this infrastructure also came the need to provide top-class support to keep developers unblocked.
The key to this project's success was adopting CI/CD as a service. This meant that we would provide guidance to developers on CI/CD development, as well as providing:
Automated merge tools
Automated release tools
A complicated problem like this needed a more technically elegant and complex solution to optimize build times rather than just throwing compute at the problem. In addition, Windows VMs can become difficult to manage without containerization. Costs can also spiral quickly, with diminishing returns on investment. Our teams found all of the technical solutions they needed in the cloud.
Using Google Cloud, we were able to develop a more stable, scalable, and sustainable approach to development that integrates a number of Google Cloud tools and services right from the beginning. When a job request comes in, Cloud Run scalers respond immediately to get the process going as quickly as possible — including a webhook scaler for instant response and speed, as well as a polling scaler for backup in case there are any webhook or external service-related issues. The scalers being Cloud Run, also auto-scale themselves to match demand and are highly reliable.
Rather than simply building directly on a VM, we utilized Compute Engine’s Windows Server for Containers images and launch a secondary Windows instance as a container on the host VM. Here we can isolate source code, assets, and build output to Virtual Hard Drives (VHDs) running on the host VM. During a build, changes made and build output can either be cached or reset at the end of the run, as we delete the container and reset the VHD back to a known state. This gives us absolute build isolation and reproducibility, as well as quickly allowing the build agent to return to the pool for its next job.
Our development process also introduces the use of “golden images” — Google Cloud images of a VM that has just run a full suite of all known possible build combinations for a specific project. For game design with Unreal, this includes builds for all platforms across debug, development, test, and shipping build configurations. All of the source, assets and build data will be cached to VHDs on the image. This cached data present on a golden image allows the next job to be iterative and thus significantly faster, while maintaining a known state to reset to.
These and other technological tools and improvements have reduced many of our biggest pain points, but even with these technical solutions, Improbable’s digital transformation would only be half-complete without also fostering a DevOps-first and engineering culture. Some of the most notable successes in this cultural transformation included:
Tracking metrics as soon as possible to identify and correct key time-wasting areas
Empowering smoother outage mitigation with reliable backup systems and clear workflows and guidelines on how best to deal with specific issues
Checking in on health checks with system redundancy to ensure key systems are working as expected
Implementing data sharing across teams for metrics around build times, reliability, and costs to keep teams honest and encourage cooperative problem solving
Staying proactive in finding, reporting, and addressing problems rather than remaining passive and reactive
To help developers do their best, our organization began to prioritize developer time over infrastructure costs. By reducing variance in lead time, we reduce developer frustration and allow developers to plan their time accordingly based on reliable delivery time averages. We also introduced practices to reduce outages, including rapid-state reporting, reducing complexity in shared systems and codebases.
The power of cloud
With the power of the cloud and focused DevOps practices, our organization has seen notable improvements in both cost savings and development efficiency. The number of build jobs performed daily went from 500 to 3000+, now with eight preflight validations for source changes whereas before there were only two. Costs dropped dramatically — from $1.7 per job to $0.5 — and projects that would have cost $900k per year in the old system can now be accomplished for $120k with the same build output.
Through the adoption of all five capabilities of cloud computing, we saw improvements to our software delivery and organizational performance. This includes:
Resource pooling: By running build requests in parallel on hundreds of VMs — including thousands of vCPUs working at maximum capacity — we optimize processing power by spreading the resource workloads, meaning that the more projects and customers involved, the lower the price is for each build job.
Rapid elasticity: With a double scaler tech stack that keeps our VM pool at optimum capacity all the time, build requests are serviced within 10-160 seconds. The VM pool is dynamically resized to match demand — including adding more VMs as needed and killing idle ones.
Measured service: We have visibility on whether we are on track with SLOs and feature budget spend by tracking our performance using bots made with Cloud Run that post easy-to-digest reports and updates, as well as track everything from build times to pass/fail rates using DataDog tracing and metrics stacks.
On demand self-service: Our developers can run experiments and gather test data without blockers or bureaucratic processes with tools that can automatically spin up a VM for their specific build request in an environment that is completely isolated from the production environment.
Broad network access: By using Cloud Identity-Aware Proxy (IAP) for access control and resource management with a zero-trust security model, our developers can remotely use our cloud resources at will without the limitations of office IP or broad, catch-all firewalls.
With these applications of the full power of the cloud, we’ve seen measurable improvements in the development process, including:
Deployment frequency: Build system rollouts went from weekly to after every merge, with projects, products, and customers now being able to deploy their metaverses hundreds of times per day.
Lead time for changes: Using the exotic cloud stack reduced lead times by at least 300% — with primary CI build average times dropping from 60 minutes to 15 and metaverse deployment builds going from 90 minutes to 25 — enabling Improbable to move faster and rapidly iterate and test new features.
Change failure rate: Overall failure rates for servicing build jobs went from 96% success to 99.99% with master build job success rates going from 80% to 99%, saving money and time — especially in the reduction of support calls.
Time to restore service: Between the dual scaler system, the container setup, and other cloud-based solutions, there have only been three occasions where we had to completely wipe the pool of VMs and restart the scalers due to technical issues, and these outages only lasted 10 minutes before all builds were back online and serviced.