Kubernetes and the challenges of continuous software delivery

This article introduces high-level concepts about managing software deployments. It also describes the challenges of designing software delivery systems, and it outlines potential answers to those challenges. When relevant, this article focuses on Kubernetes as a platform for software deployment.

Many organizations are now looking to implement continuous delivery, whose goal is to release software in an efficient, quick, and sustainable way. If you're a platform architect, a DevOps engineer, or a release engineer in charge of designing and building the software delivery processes in your organization, this article is intended for you.

As you read this article, keep in mind that you don't have to tackle every challenge at once. You can incrementally improve your existing systems. You can start with changes that have the most impact, which are often not the most difficult to implement. When you decide what you want to modify, remember to optimize for two things: feature velocity and production stability. These goals might seem opposed, but as the State of DevOps reports show, these goals can actually be aligned.

How you approach software delivery in general, and continuous delivery in particular, can vary depending on your computing platform. Kubernetes is now a widely used platform on which to run applications. One of its main advantages is that it uses a declarative model. You describe the state that you want your applications to be in, and Kubernetes automatically attempts to reach that state. This declarative model can be contrasted with imperative deployment methods (such as those used by Ansible or Chef) in which you describe each step needed to reach the state that you want.

Deploying to Kubernetes has a major impact on how you deploy applications. It affects the tools and methods you use, and it typically requires less logic to maintain than in deployments to virtual machines or bare-metal servers.

Key concepts

This section defines terms and illustrates the relationships between concepts used in this article.

Diagram that shows the relationship of key concepts in software deployment.

Deployment artifact

A deployment artifact, or artifact, is a packaged application or module that is ready to be deployed and installed. An artifact is immutable: it cannot be modified after it's created. Many different formats for artifacts exist, for example, JAR or WAR files for Java applications, gems for Ruby applications, or Debian or RPM packages. In this article, the assumed format for an artifact is a container image. Artifacts are stored in an artifact storage system. The exact system used depends on the artifact format. For container images, you use container registries, such as Container Registry.

Deployment

In the context of software delivery, the word deployment is unfortunately ambiguous. It can mean three things:

  • The action of installing and configuring an artifact in a specific environment. For example: "My deployment to production failed."
  • The end result of the deployment action. For example: "The staging and production deployments are different."
  • A specific Kubernetes object that helps run multiple copies of a container. To distinguish this meaning from the other two, we use a capital D when referring to this Deployment.

By default, we use the word deployment (with a lowercase d) in the "action" sense.

Release

A release is a specific artifact you deem stable enough to use for production. To create a release, you usually create several artifacts, each new one fixing issues from, or adding features to, the previous one until you reach the stability or feature capabilities you want. Those artifacts can have names such as alpha, beta, or release candidate. The concept of release is widely used by software vendors or for software running on end-user devices (mobile apps, desktop software).

Used as a verb, release means to expose an artifact to production usage. While many teams don't distinguish between deploying to production and releasing, these terms are different. It's possible to deploy a feature to production and only release it with a feature flag or with a blue/green procedure.

A release channel is one way to offer flexibility to your users. With channels, for example, you can create stable releases for users who prioritize reliability and rapid releases for users who want new features as soon as possible.

Pipeline

A pipeline is a computing pattern that takes something as input, runs a series of processing jobs, and returns an output. You can serialize or parallelize these stages, and you can represent them by a directed acyclic graph. This article discusses pipelines used in software delivery.

Continuous integration

Continuous integration (CI) is a methodology in which developers test and build (integrate) their code changes as often as possible. The goal of CI is to tighten the development feedback loop and to surface errors and problems as early as possible in the development process. CI operates on the rationale that the later an error is discovered, the more expensive it is to fix. Discovering a defect in production is, of course, something most developers want to avoid.

CI is usually implemented with pipelines that run periodically or are triggered when a developer pushes a code change. For example, a CI pipeline could have three jobs: one that runs a linter, one that runs unit tests, and one that builds an artifact from the source code.

The usual output of a CI pipeline is an artifact in an artifact storage system.

Continuous delivery

Continuous delivery (CD) is the capability of releasing code at any time. CD assumes that your code has passed the CI pipeline and any tests that you deem necessary (such as smoke testing, QA, and load testing).

It's important not to overlook the significant differences between CI and CD. Many organizations create a single pipeline that implements both CI and CD. We do not recommend this setup because CI and CD have the following conflicting goals:

  • The goal of CI is to provide a short feedback loop to developers. A CI pipeline should run in less than 10 minutes.
  • The goal of CD is to ensure the stability of your production environment. This goal might necessitate processes that can last several hours (such as canary analysis).

When you're actively developing an application, with several commits to the main branch each day, it's not ideal to trigger your full CD pipeline after your CI pipeline succeeds if your CD pipeline requires several hours.

Continuous deployment

Continuous delivery and continuous deployment are often used interchangeably, but in this article we make a distinction. Continuous deployment takes continuous delivery a step further by automatically releasing the application once it passes the required tests.

If you implement continuous deployment (and not only continuous delivery), the concept of release as a specific artifact loses its meaning because you potentially push every artifact to production.

In this article, we use the abbreviation CD for continuous delivery, not continuous deployment.

Environment

An environment is the infrastructure or set of computing, networking, and storage resources on which you deploy your application. When you work in the cloud, instances of managed services (a Pub/Sub topic or a Cloud Spanner instance, for example) are part of the environment. On Google Cloud, the best practice is to dedicate a project for each environment.

A set of configurations is usually associated with the environment for the applications it's hosting. These configurations usually vary from one environment to another. For more information, see the Configuration section.

Environments are usually isolated from each other: they don't share resources. An application is usually deployed to several environments before it reaches the production environment. Multiple applications can share the same environment. For more information, see the Environments section.

Configuration

A configuration is a piece of information your application needs in order to run that isn't part of the deployment artifact. You can change your configuration without creating or deploying a new artifact. A configuration isn't specific to a single deployment artifact. Different configurations can have different properties. The concept of a configuration is especially related to subjects discussed in this article, particularly GitOps. For example:

  • A configuration can have a value that varies by environment and is used by multiple applications in an environment. For example, the database host to use in environment A is db-a.example.com, but in environment B it's db-b.example.com.
  • A configuration can have a value that varies by environment and is used by a single application. For example, the database user to use in environment A is alice, but in environment B it's bob.
  • A configuration can be used by multiple applications that are themselves managed by different teams.

Typically, a configuration's name stays the same across environments, while a configuration's value varies across environments. However, you might choose to create a configuration whose value doesn't vary across environments for the following reasons:

  • You want to keep the possibility of making the value change in a specific environment in the future.
  • You want to be able to deploy and roll back an artifact without changing the value of the configuration.

Secret

Secrets are a specific kind of configuration. Examples of secrets include database passwords, API keys for third-party services, and cryptographic keys. You should protect secrets to help prevent attackers from accessing what are otherwise private systems.

Deployment model

An abstract model is useful in discussions about software delivery and changes to a production system. This section introduces such a conceptual model. Changes you can make to a production system can be organized into five broad categories:

  • Deploying a new application version
  • Changing your configuration without changing the deployed artifact
  • Changing a secret that your application uses
  • Changing persistent data
  • Changing the infrastructure

You must treat each of these changes differently. For each category, we provide an overview of related Kubernetes concepts.

Deploying a new application version

This section discusses what happens when you deploy a new artifact in an environment. Artifacts are immutable across environments, which means that if you want to deploy the same version of your application in two environments (and possibly at different times), then you should use the same artifact in both cases.

Deploying an artifact in an environment requires the infrastructure to download that new artifact and (depending on the technology used) to reload, restart, or recreate processes. If you're using containers, you need to replace the containers based on the previous version's image with ones based on the new image.

When you are running multiple instances of your application simultaneously, you can use many different deployment strategies to update them. These strategies differ in how many instances you replace at once, how fast you replace them, how many additional instances you create during the process, and how you load-balance the traffic during the process.

How you roll back or revert such changes depends on the technology and deployment strategy you use. In many cases (especially with containers), you implement a roll back as a roll forward, where you deploy the old image as if it were a new one.

You can use four main Kubernetes objects to deploy an application on Kubernetes:

  • The ReplicaSet object lets you specify the number of Pods (instances) of your application. This API object takes care of recreating any crashed instance.
  • The Deployment object helps automate ReplicaSets, for example, by gradually replacing an old ReplicaSet with a new one.
  • The StatefulSet object lets you create a set of Pods of your application with stable identities and storage. This API object is especially useful for distributed databases and other distributed systems that can't rely on Kubernetes' service discovery.
  • The DaemonSet object runs a single Pod of your application on every node of the Kubernetes cluster. You might generally use a DaemonSet to provide additional features on top of Kubernetes, such as log shipping or monitoring.

Deployments are the most commonly used objects for deploying applications on Kubernetes. Deployments are useful when you want to simplify your CD pipelines. Kubernetes takes care of the deployment (hence the name Deployment). By design, the Deployment object only offers simple deployment strategies. If you want to implement more complex deployment patterns, we recommend that you build automation in your CD pipeline by using ReplicaSets. ReplicaSets have a simpler behavior than Deployments.

Changing your configuration without changing the deployed artifact

In theory, we would like to treat all changes the same way and use the same processes to roll them out. Unfortunately, this isn't possible in practice; artifacts and configurations have different lifecycles and constraints. For instance, the same artifact is used across all environments, ensuring that what passes tests in one environment is the same thing deployed to production. However, this dynamic isn't true of configurations. Configurations often vary from one environment to another and thus make configuration changes hard to test.

Because configurations usually don't change at the same time and in the same way as artifacts, you can't always reuse the CI/CD process designed for artifacts. However, you can still automate configuration changes or reuse parts of that CI/CD process. Automation is important because it reduces the risk of human error and enables quicker disaster recovery scenarios.

Another approach to the problem of configuration change is to separate a configuration that stays the same across environments from a configuration that varies by environment. For a configuration that is stable across environments, you can use the same CD pipeline that you use for artifacts. Sharing a pipeline lets you comprehensively test your configuration changes across multiple environments. To deploy an environment-dependent configuration change, you should use the part of the CD pipeline that is specific to that environment. This approach lets you reuse tested automation and ensures that your configuration change undergoes at least some testing (any post-deployment tests that you have built into your CD pipeline, or any tests that happen during the deployment itself).

For the reasons discussed earlier in this section, testing environment-dependent configuration changes is notoriously hard.

Several approaches to configuration testing are available. For example, you can implement a CI-like testing methodology as described in Configuration Testing: Testing Configuration Values Together with Code Logic, and use canaries to test your change on a small portion of the application instances. However, even advanced methods such as these cannot help you test something like a database host change that's synchronized across all application instances at once. In theory, you can test any configuration change, provided you have developed your application to test that specific change (including changing a database host). However, developing your application to test all possible configuration changes isn't doable in practice because of the work required. You have to analyze every configuration change to determine if you're willing to take the risk of deploying it without dedicated testing, or determine if you prefer investing significant time in developing change-specific tests.

Because testing environment-dependent configuration changes is inherently difficult, we recommend that you keep your configuration environment-independent when possible. This approach lets you test configuration changes across multiple environments. In other words, you should try to keep your various environments as similar to each other as possible.

We've shown how environments can affect configuration changes, but this isn't the only dimension to consider. As defined earlier, a configuration can be used by a single application, multiple applications, or even multiple applications managed by multiple teams. Changing a configuration used by multiple applications and teams is significantly harder than changing one used by a single application. Changing a shared configuration requires coordination across teams, coordinated testing, and coordinated deployment. Automation can help simplify that process, but implementing such a change remains a challenging task.

Organizations often underinvest in testing configuration changes because of the inherent challenges. As a result, configuration changes become a chief cause of production outages.

In Kubernetes, a configuration is stored in ConfigMaps. A ConfigMap is a simple key/value list. You can use ConfigMaps in two ways:

  • You can inject some of its content as environment variables in a Kubernetes Pod.
  • You can inject some of its content as a file mounted on the Pod's file system.

Kubernetes alleviates the problems discussed earlier in this section by providing powerful automation primitives. A good practice (implemented by Spinnaker, for example) is to consider your ConfigMaps immutable. When you want to change a configuration, you create a new ConfigMap, and then you modify your Deployment to use this new ConfigMap. The Deployment recreates the Pods one by one, slowly rolling out the configuration change. If the new configuration breaks your application to the point where the Pods don't start or crash quickly, then the Deployment stops.

Changing a secret that your application uses

Your application likely needs secrets to work. Because secrets are sensitive by nature, you can't manage them as you would a normal configuration.

For example, it's quite common to store a configuration in a Git repository. You shouldn't use this practice for secrets without taking precautions. For instance, you need to encrypt the secrets in your repository. Several tools exist for that purpose. Or you can use a centralized secret management solution, as many organizations do. In this approach, you store your secrets in a single database, and various systems retrieve those secrets based on their credentials. Secret management is a large and controversial topic, and a full treatment is outside the scope of this article. The discussion here is limited to how to manage secret changes.

The data source for secrets might differ from the one used for configurations, but apart from this difference, deploying a secret change is very similar to deploying a configuration change. The principles and problems described in the Changing your configuration without changing the deployed artifact section are the same for secrets. As always, automation helps with the challenges of secret management, especially considering that a best practice is secret rotation (periodically changing the value of the secret to prevent its reuse if it becomes compromised). Secret rotation isn't feasible in practice unless secret changes are automated.

The previous section recommends that you keep configurations as stable as possible across environments. This guidance isn't true for secrets. For security reasons, secrets should be different in every environment. For example, the secrets for your development environment probably differ from the secrets for your production environment. Maintaining different secrets helps prevent an attacker who has secrets for one environment from immediately jumping to another environment. Fortunately, applications tend to not work at all when there is a problem with a secret, which makes any problem easy and quick to detect.

A Kubernetes Secret is similar to a ConfigMap in its function and usage. The difference is that data stored in a Secret isn't displayed by default (it's base64-encoded), and you can use Application-layer Secrets Encryption to encrypt its content in the Kubernetes database. The Workload Identity feature of Google Kubernetes Engine (GKE) makes integration with some secret managers such as Secret Manager, Hashicorp Vault, or Berglas much easier. With this feature, each Deployment can have its own dedicated Google Cloud service account and use it to authenticate against those secret managers (without having to deal with keys).

Changing persistent data

Sometimes, you need to change the data that your application uses. For instance, you might need to change the schema of a relational database or modify data in a database. Those changes are hard to achieve safely because they can be very difficult, or even impossible, to roll back. This complex topic is the subject of extensive research, and a complete discussion isn't in the scope of this article. The management of those changes isn't fundamentally different in Kubernetes than in other platforms.

For more information on how to change your database system, read Data migration strategies. For a good overview of database schema changes, see Evolutionary Database Design. For an extensive list of references on database schema changes, see Schema evolution.

Database management isn't associated with a Kubernetes object, but some objects are still of interest in this context:

  • If you want to run a relational database in Kubernetes, we recommend that you use the StatefulSet object. Running a database in Kubernetes is an advanced use of Kubernetes, and thus we don't recommend that people new to Kubernetes attempt this before gaining some experience. For more information, see To run or not to run a database on Kubernetes: What to consider.
  • Modifying the schema of a relational database is a good use case for Kubernetes Jobs. Jobs run a given Pod until successful completion, and they have retry mechanisms.

Changing the infrastructure

An application runs on infrastructure consisting of compute, networking, and storage resources. This infrastructure includes any managed services used by your application. Your choice of single-tenant or multi-tenant infrastructure affects how you can change the infrastructure. The more platform users of a specific infrastructure there are, the more coordination is needed to proceed with any change. Because infrastructure is tightly coupled to the concept of environment, testing infrastructure changes is challenging, as differences tend to exist between the infrastructures of different environments. Because of the tight coupling between infrastructure and environments, we explore these topics in more depth in the Testing environments for infrastructure section.

In the context of Kubernetes, infrastructure can mean two things: the Kubernetes cluster itself (it's providing compute, networking, and storage resources), or specific objects inside Kubernetes (APIs for those resources).

The Kubernetes cluster itself is a piece of infrastructure that you configure, update, and upgrade. We recommend using infrastructure as code (IaC) to manage a cluster, configure it, update it, and upgrade it. Kubernetes provides many primitives to help you safely roll out those changes:

Other Kubernetes objects can be considered as infrastructure themselves:

  • Services and Ingresses are respectively L4 and L7 load balancers in Kubernetes. Additionally, Services are the service discovery mechanism of Kubernetes.
  • NetworkPolicies are a software-defined firewall in Kubernetes.
  • PodSecurityPolicies and ResourceQuotas are limitations on the objects you can create in the cluster.
  • All the RBAC resources are used to control who can do what within the cluster.
  • The security and traffic routing features of Istio.

Specifically for storage, Kubernetes has several objects that you need to manipulate:

In Kubernetes, you can change your storage in two ways. If you're using a compatible storage provider (such as Compute Engine persistent disks), you can resize PersistentVolumes. If you're running a distributed database or storage system, then the preferred method is a rolling recreation of the nodes of your system (with an updated PersistentVolumeClaim).

Some solutions such as Rook provide a Kubernetes-native interface to complex distributed storage systems.

Organizational model

How you approach and implement software delivery for every one of the five preceding categories depends on how your organization is structured, for example, what teams exist, what skills are in your organization, how those skills are spread across teams and people, what your governance model is, and whether you are running on a shared infrastructure.

Single-tenant versus multi-tenant clusters

A single-tenant Kubernetes cluster is a cluster where a single team is deploying applications. In other words, anyone deploying an application in this cluster has significant knowledge about all the other applications running in this cluster. A multi-tenant cluster can be used by multiple teams who aren't necessarily aware of everything else that's running in the cluster.

Whether you choose to use single- or multi-tenant clusters has significant security implications. The security boundary between two clusters, especially in different projects, is much stronger than between different workloads within the same cluster, even if they're running in different namespaces. While there are a lot of security features in GKE (network policies, pod security policies, RBAC, workload identity, GKE Sandbox), none provides a stronger isolation between workloads than using two different clusters.

Even though multi-tenant clusters offer significant advantages (cost optimization, easier governance), one disadvantage is that a single problem can have a large impact (multiple teams and applications are affected).

The choice of running a multi-tenant cluster or multiple single-tenant clusters affects your deployment strategy, as the next sections explain in detail.

CI/CD tooling: Centralized or distributed?

You can organize your CI/CD tooling in three ways, depending on what team is in charge of what part of the toolchain. In the following models, we use four team names:

  • Application team: a team in charge of developing one or several applications.
  • Release team: a team of CI/CD specialists, usually with a deep operations background.
  • Infrastructure team: a team in charge of the infrastructure (the Kubernetes clusters).
  • Platform team: a team that has both the functions of a release team and of an infrastructure team.

Every organization has its own model. The three models described in this section are only broad categories into which most organizations fall.

As you consider these models, remember that security is a very important aspect of your CI/CD tooling. Because this tooling is the gateway to all your environments, it's a powerful attack vector. If you're managing your own CI/CD tools, it's important to properly maintain and keep them up to date. We recommend using Identity-Aware Proxy to help secure access to those tools. ACLs within those tools are also important, especially if they're shared by multiple teams. However, how you implement security measures depends on the tools you choose.

Fully distributed model

In this model, every application team is free to choose what tooling and methodologies work best for them, leaving these decisions to the people who are impacted by them.

Even if large organizations are adopting this model, they still need to invest in a dedicated release team. In this model, the release team serves as an internal consultant. You can't expect every application team to have CI/CD expertise. In some organizations that use this model, the release team also provides a single-tenant, prepackaged, CI/CD toolchain to the application teams. In that case, the teams are free to use this service or not, but receive no support if they don't use it.

We don't recommend you use this model if you have a multi-tenant cluster; the risk of one team impacting another is too great.

From a security perspective, this model increases the risk of having CI/CD tools that nobody maintains or keeps up to date. On the other hand, as a single team uses each tool, the impact of any misconfiguration or attack on a CI/CD tool is greatly reduced.

The following diagram represents the fully distributed model where application teams manage and use both their CI and CD tools while receiving help from a specialized release team.

Diagram that represents the fully distributed model of the continuous integration continuous deployment toolchain.

Fully centralized model

In this model, a central release team provides a multi-tenant CI/CD toolchain with pipeline templates. The release team maintains and updates the CI/CD tools, and it develops and pushes CI/CD best practices. That team either runs one or more Kubernetes clusters themselves (in which case they're a platform team) or works closely with the infrastructure team that does.

In this model, it's almost as if the release and infrastructure teams manage a platform as a service (PaaS) on which the application teams run.

While many organizations aspire to adopt this model, it presents some significant challenges. The release team must either provide CI templates for every technology that the applications teams use, or limit the technologies that can be used.

With the fully centralized model, if the organization uses multi-tenant clusters, it often uses multi-tenant CI/CD tooling. On the other hand, if the release team uses single-tenant clusters, it often maintains an instance of the CI/CD toolchain for each application team.

From a security perspective, having a team that owns CI/CD tooling increases the chance that those tools will be well maintained and secured. If you use multi-tenant CI/CD tooling, then the ACL system of those tools becomes very important because you want to tightly control who can deploy what to what environment.

The following diagram represents the fully centralized model where a specialized release team manages both the CI and CD tools that the application teams use.

Diagram that represents the fully centralized model of the continuous integration continuous deployment toolchain.

Semi-distributed model

This model is a mix of the two preceding models. In organizations that adopt this model, application teams own CI while the release team owns CD. This model lets application teams choose what CI tooling and processes to use, and it leaves the reliability of the production systems to a team that specializes in that work. We still recommend that the application teams use a CI tool managed by the release team, even if they're free to build their own pipelines. This approach ensures that someone is responsible for maintaining the CI tool.

In this model, the applications teams build and package the applications and the release team deploys them. Therefore, these teams need to agree on a specification for the application packaging format. As a universal packaging technology, containers have made this fairly easy for all technologies. Even more than in the other models, all teams should follow best practices for building containers and best practices for operating containers.

In this model, if the organization uses multi-tenant clusters, it's also often uses a multi-tenant deployment tool. As in the fully centralized model, if a release team uses single-tenant clusters, it often maintains an instance of the CI/CD toolchain for each application team.

The following diagram represents the semi-distributed model where the release team manages the CD tooling and provides prepackaged CI tooling for the application teams to use.

Diagram that represents the semi-distributed model of the continuous integration continuous deployment toolchain.

Evolution from one model to another

Organizations often start with the fully distributed model because it's one that often develops naturally as an organization grows. At a certain point, almost all organizations look to move either to the fully centralized or to the semi-distributed model for governance, reliability, and cost reasons. In any case, moving away from a fully distributed model is a large-scale project that requires significant buy-in from all teams. Some application teams are happy to not have to manage CI/CD, while others (especially if they have significant experience managing CI/CD) can be reluctant to give up that responsibility. The particulars of this process are highly dependent on your organization, but a good general principle applies here: make hard what you want people to stop doing, and make easy what you want people to start or keep doing.

For example, adopting Site Reliability Engineering (SRE) falls into the second category. SRE is a good practice in any IT organization, but it's most useful in the context of moving from one organization model to another because it can give application teams significant motivation to move to a more centralized model if you make it a prerequisite for them getting the support of an SRE team.

Environments

Almost every organization in the world, and certainly all large companies, have multiple environments for their applications. While the name and number of those environments can vary from one organization to another, the business goal is always the same: separate the production environment, with which users and customers interact, from non-production environments, which you use for testing changes before deploying them to production.

Environments usually fall into one of the four following categories:

  • Solo developer environment. This environment is most commonly hosted on the developer's workstation, or in a virtual machine dedicated to a single developer. Developers are usually free to do whatever they want in this environment. Many companies provide pre-packaged environments for local development to help on-board new employees faster and help developers collaborate more easily (as they have similar testing environments).

    In the case of Kubernetes, you can use kind, Minikube, Docker Desktop, or even small GKE clusters.

  • Shared development environments. In shared environments, multiple developers can test their changes against other parts of the system. While it might be impractical to have a large database in a local development environment, a shared development environment can have such a database. This environment offers a high level of freedom for developers because they can usually change a large number of parameters to test things out. A challenge with shared development environments is their maintenance. Because developers can change a lot of things, these environments tend to degrade fairly quickly. You must automate the cleanup or recreation of these environments to ensure long-term stability.

    In the case of Kubernetes, a shared development environment is usually a single, medium-sized GKE cluster.

  • Staging environments. There is no commonly accepted definition of a staging environment. For our purposes, we define staging environments as test environments meant to replicate production environments as closely as possible. You can use staging environments to perform large-scale tests of changes (such as QA or load testing) before you push them to production. You can also use staging environments for integration testing of multi-service applications.

    In the case of Kubernetes, the staging environment is often a GKE cluster that is almost identical to the production cluster. Some companies, for cost reasons, choose to run a smaller-scale version of the production cluster, use preemptible VMs, or automatically shut down non-production environments outside of business hours.

  • Production environment. This is the environment your users and customers use. It's important to realize that no amount of testing in the other environments can reliably predict the behavior of your application when users are using it in production. We recommend that you plan for this and ensure that you can debug and observe your application while it's running in production. Because the production environment is the best place to see how your application performs, some organizations choose to not have any staging environment and directly push to production. This approach requires advanced deployment methods to ensure that any problem can be detected and remediated as quickly as possible and cannot impact too many users.

Dynamic environments

After some teams reach a certain maturity level, many want to implement dynamic environments. Dynamic environments are short-lived environments that you create for testing a specific change by a developer. Such environments are ideal for testing changes independently from any other in-flight changes that might be undergoing testing at the same time. However, in a microservice application, managing dynamic environments is fairly complex to achieve. Each microservice probably relies on a number of others to function properly. Deploying all the microservices for every dynamic environment quickly becomes impractical as the number of microservices grows.

One way to approach this challenge is the following. First, you must ensure that you're following microservices best practices as described in Migrating a monolithic application to microservices on GKE. Then, you deploy a new version of a microservice in parallel to the existing one. You can either directly test this new version, or use Istio to only route specific traffic to it. In effect, this approach lets you test specific changes for multiple microservices in parallel, in the same environment, as the following diagram illustrates.

Diagram that illustrates the parallel testing of microservices.

Testing environments for infrastructure

The various environments described in the preceding sections are all used to test changes from the first four categories of changes (application, configuration, secret, and persistent data) mentioned earlier in the Deployment model section. Infrastructure changes, because of their nature, are usually handled differently from those types of changes. We recommend that you use IaC to handle infrastructure changes. Keep in mind that IaC is a relatively recent paradigm and still an area that many are experimenting with. Testing infrastructure changes is not a trivial task. An easy way to do a functional test of an infrastructure is to deploy your application on top of it and test the application. However, this isn't enough for testing against compliance and security rules, for example. In this case, we recommend the use of a dedicated infrastructure testing framework (such as Chef's InSpec, or Terratest for Terraform).

You can handle infrastructure changes in three main ways:

  • You can use existing environments (such as development and staging) to test infrastructure changes. With this approach, you don't have to manage additional environments just for testing infrastructure changes, and your developers can quickly detect most functional problems that your changes might introduce. However, you might end up with development and staging environments that are significantly different from production for a long period of time.
  • You can use environments dedicated to infrastructure testing. Developers don't interact with those environments, and you're able to run all the tests that you want. Because you control 100% of the stack in those environments, you're able to quickly make changes and correct problems. However, this approach requires additional infrastructure. You also still need to modify the development and staging environments before you push your changes to production (but the time during which these environments are different from production is significantly shorter than in the first case).
  • You can use release channels for infrastructure. Like for GKE release channels, you can let teams choose which version of the infrastructure they want. This lets you test infrastructure changes in less critical environments (non-production environments, non-critical applications) before pushing them to more sensitive environments. Importantly, the users of the infrastructure that you provide can judge how critical a specific environment is.

GitOps concepts

GitOps is an increasingly popular deployment methodology in the Kubernetes community. This section introduces the concept and discusses some important tradeoffs to consider when you design your CI/CD pipeline for Kubernetes-hosted applications.

Google engineers have been storing configuration and deployment files in our primary source code repository for a long time. The book Site Reliability Engineering, Chapter 8 describes this methodology, which Kelsey Hightower demonstrated during his Google Cloud Next '17 keynote. You might consider using Anthos Config Management, which is a GitOps-like product for managing Kubernetes resources through a Git repository.

The term GitOps itself was coined by Weaveworks. When you practice GitOps, all your environments and deployments are described by text files (for example, Kubernetes manifests) stored in a Git repository. Your environments read their configurations from a specific branch of this Git repository. Any new commit on that branch results in a change in the environment. The main advantage of GitOps is that it lets you use well-understood and powerful Git workflows:

  • The Git history represents the entire history of your infrastructure.
  • You can roll back to any previous state by using Git commands.
  • You can implement manual and automated reviews and validations by using code review workflows such as pull requests.

During its lifetime, your Kubernetes cluster has different objects, for example, Deployments, StatefulSets, and ConfigMaps. The complete set of those objects at a specific point in time is the state of your Kubernetes cluster.

By design, Kubernetes gives very few options to control the transition from one state to another. Two of those options are the maxSurge and maxUnavailable parameters for a Deployment. However, if you want to implement more advanced deployment strategies, you might want greater control over the transitions from one state to another.

The following diagram illustrates a blue/green deployment, which requires a series of different states. You deploy your new version (green) in parallel to the existing one (blue), and then you switch the traffic from blue to green.

Diagram that illustrates the blue-green deployment method.

As the preceding diagram shows, states 2 and 3 are different from states 1 and 4. States 1 and 4 are nominal states that are meant to be stable over a long period of time. States 2 and 3 are transition states that aren't meant to be the nominal state of your system. When you implement advanced deployment patterns that require those transition states, whether you reflect those transition states in your Git repository affects the GitOps implementation that you need. We explore those two patterns later in this article.

In both patterns, the Git repository is updated first, and then the cluster is updated to reflect the content of the repository. If the update on the cluster fails, a desynchronization between your repository and your cluster occurs, which you must then resolve. We recommend that you set up your systems to send alerts in such a case. If the update fails because the new state declared in Git isn't valid, then we recommend that you mark that state as invalid. You don't want to accidentally roll back to it. In other words, in both patterns, the Git repository represents the states that the cluster has attempted to reach, not the states that it has achieved. You could design a system where the Git repository is updated after the cluster, to reflect only the states that were actually achieved. However, this approach isn't the industry standard for GitOps. In such a design Git could not be the place where you operate your cluster (as you modify the cluster before the repository).

Pattern 1: Only nominal states in the Git repository

In this pattern, only the nominal states of your cluster are in your Git repository. You trigger the deployment by pushing the new nominal state to your repository. This push triggers a complex deployment pipeline in an orchestrator that takes your cluster through the transition states and to the newly declared nominal state, as the following diagram shows. In this scenario, only human operators write to the Git repository.

Diagram that shows the nominal states of a cluster in the repository for a blue-green deployment.

The preceding diagram shows a timeline of the cluster state and the repository state for our earlier blue/green example. The repository and cluster states are desynchronized after you trigger the deployment by pushing the new nominal state (state 4) to the repository. They are resynchronized only when the operator "catches up" by going through all the transition states (states 2 and 3).

You can implement this pattern by using an orchestrator external to the cluster, such as Google Cloud Build, Spinnaker (that comes with many of those advanced deployment patterns already built-in), or Jenkins.

Alternatively, you can use an Operator. The Operator pattern was pioneered by CoreOS. An Operator is a controller process running inside your cluster and automates the management of your application. Operators are available for a number of well-known applications, or you can create your own based on your needs. Operators have the advantage of being portable, so you can easily reuse one written by someone else. Orchestrators also give you more control, and often a better visibility, on the update process.

Advantages of this pattern:

  • The deployment is triggered by modifying the Git repository, which is aligned with the commonly accepted GitOps definition.
  • The Git history remains useful for humans because it doesn't include transition states, which can be numerous and potentially confusing.

Disadvantages of this pattern:

  • The Git repository doesn't reflect all the states that the cluster goes through.
  • The repository and the cluster are desynchronized at one point.

Pattern 2: Nominal and transitional states in the Git repository

In this pattern, all the states (nominal and transitional) of your cluster are in your Git repository. In this pattern, you can't trigger the deployment by pushing the new nominal state to the repository because you want transition states to be in the repository between the old and new nominal states. Therefore, you need to trigger the deployment by interacting with the orchestrator directly, and the orchestrator updates both the cluster and the repository. In this pattern, only the orchestrator writes to the repository.

The following diagram shows a timeline of the cluster state and the repository state for our earlier blue/green example. The two experience only short desynchronizations. Each transition from one state to another is handled similarly to the others.

Diagram that shows the nominal and transitional states of your clusters in the repository for a blue-green deployment.

For this pattern, we recommend using an external orchestrator and not an Operator. Orchestrators typically have a built-in history of the jobs they have executed. This is useful if you need to audit what happened at a specific moment in the past, and it lets you link a Git commit to a job in the orchestrator. Operators usually don't have a built-in history of actions. If an Operator succeeded to update the cluster but not the repository, you would have a desynchronization that would be very difficult to debug.

Advantages of this pattern:

  • The Git repository represents all the cluster states.
  • The repository and the cluster only experience short desynchronizations.
  • This pattern is modular, and each transition from one state to another is handled similarly to the others. The only unique part for each transition is the transformation of the state definition. The code to push the new state to the repository, and to apply it to the cluster, is exactly the same for each transition and can be factored out.

Disadvantages of this pattern:

  • You don't trigger the deployment by interacting with the Git repository, but with the orchestrator. This approach isn't aligned with the commonly accepted GitOps definition.
  • The Git history can be difficult to understand for a human because all transition states are represented. The states can be numerous and confusing.

Questions raised by GitOps

If you determine that you want to implement GitOps by using one of the two preceding patterns, there are still a number of questions that you need to answer:

  • How many Git repositories are you using? And what are you using them for?
  • How many Git branches in each repository are you using? And what are you using them for?
  • How do you handle multiple target environments?
  • Where do you store your manifest templates?
  • Where do you store your generated manifests?
  • Where do you store your application configuration?
  • How do you deal with a configuration that is shared by multiple applications?
  • How do you handle secrets?
  • How do you control the access to your Git repositories and branches?

Most of these questions have no single right answer. The exact system you choose will depend a lot on the size of your organization, the way your teams interact with each other, and your current technical ecosystem (Git service provider, secret manager, the environments that you have).

GitOps-style continuous delivery with Cloud Build is an example of a GitOps setup using Cloud Source Repositories and Cloud Build that can work for a small team that is working on a relatively simple application. Managing infrastructure as code with Terraform, Cloud Build, and GitOps is another GitOps example, but this time to manage infrastructure, and not an application.

What's next