This document helps you plan and design the optimization phase of your migration to Google Cloud. After you've deployed your workloads in Google Cloud, you can start optimizing your environment.
This document is part of a series:
- Migration to Google Cloud: Getting started
- Migration to Google Cloud: Assessing and discovering your workloads
- Migration to Google Cloud: Building your foundation
- Migration to Google Cloud: Transferring your large datasets
- Migration to Google Cloud: Deploying your workloads
- Migration to Google Cloud: Migrating from manual deployments to automated, containerized deployments
- Migration to Google Cloud: Optimizing your environment (this document)
The following diagram illustrates the path of your migration journey.
In the optimization phase, you refine your environment to make it more efficient than your initial deployment.
This document is useful if you're planning to optimize an existing environment after migrating to Google Cloud, or if you're evaluating the opportunity to optimize and want to explore what it might look like.
The structure of the optimization phase follows the migration framework described in this series: assess, plan, deploy, and optimize. You can use this versatile framework to plan your entire migration and to break down independent actions in each phase. When you've completed the last step of the optimization phase, you can start this phase over and find new targets for optimization. The optimization phase is defined as an optimization loop. An execution of the loop is defined as an optimization iteration.
Optimization is an ongoing and continuous task. You constantly optimize your environment as it evolves. To avoid uncontrolled and duplicative efforts, you can set measurable optimization goals and stop when you meet these goals. After that, you can always set new and more ambitious goals, but consider that optimization has a cost, in terms of resources, time, effort, and skills.
The following diagram shows the optimization loop.
For a larger image of this diagram, see Optimization decision tree.
In this document, you perform the following repeatable steps of the optimization loop:
- Assess your environment, teams, and the optimization loop that you're following.
- Establish optimization requirements and goals.
- Optimize your environment and train your teams.
- Tune the optimization loop.
This document discusses some of the site reliability engineering (SRE) principles and concepts. Google developed the SRE discipline to efficiently and reliably run a global infrastructure serving billions of users. Adopting the complete SRE discipline in your organization might be impractical if you need to modify many of your business and collaboration processes. It might be simpler to apply a subset of the SRE discipline that best suits your organization.
Assessing your environment, teams, and optimization loop
Before starting any optimization task, you need to evaluate your environment. You also need to assess your teams's skills because optimizing your environment might require skills that your teams might lack. Finally, you need to assess the optimization loop. The loop is a resource that you can optimize like any other resource.
Assess your environment
You need a deep understanding of your environment. For any successful optimization, you need to understand how your environment works and you need to identify potential areas of improvement. This assessment establishes a baseline so that you can compare your assessment against the optimization phase and the next optimization iterations.
Migration to Google Cloud: Assessing and discovering your workloads contains extensive guidance about assessing your workloads and assessing your environments. If you recently completed a migration to Google Cloud, you already have detailed information on how your environment is configured, managed, and maintained. Otherwise, you use that guidance to assess your environment.
Assess your teams
When you have a clear understanding of your environment, assess your teams to understand their skills. You start by listing all skills, the level of expertise for each skill, and which team members are the most knowledgeable for each skill. Use this assessment in the next phase to discover any missing skills that you need to meet your optimization goals. For example, if you start using a managed service, you need the skills to provision, configure, and interact with that service. If you want to add a caching layer to an application in your environment by using Memorystore, you need expertise to use that service.
Take into account that optimizing your environment might impact your business and collaboration processes. For example, if you start using a fully managed service instead of a self-managed one, you can give your operators more time to eliminate toil.
Assess your optimization loop
The optimization loop is a resource that you can optimize too. Use the data gathered in this assessment to gain clear insights into how your teams performed during the last optimization iteration. For example, if you aim to shorten the iteration duration, you need data about your last iteration, including its complexity and the goals you were pursuing. You also need information about all blockers that you encountered during the last iteration to ensure that you have a mitigation strategy if those blockers reoccur.
If this optimization loop is the first iteration, you might not have enough data to establish a baseline to compare your performance. Draft a set of hypotheses about how you expect your teams to perform during the first iteration. After the first optimization iteration, evaluate the loop and your teams's performance and compare it against the hypotheses.
Establishing your optimization requirements and goals
Before starting any optimization task, draft a set of clearly measurable goals for the iteration.
In this step, you perform the following activities:
- Define your optimization requirements.
- Set measurable optimization goals according to your optimization requirements.
Define your optimization requirements
You list your requirements for the optimization phase. A requirement expresses a need for improvement and doesn't necessarily have to be measurable.
Starting from a set of quality characteristics for your workloads, your environment, and your own optimization loop, you can draft a questionnaire to guide you in setting your requirements. The questionnaire covers the characteristics that you find valuable for your environment, processes, and workloads.
There are many sources to guide you in defining the quality characteristics. For example, the ISO/IEC 25010 standard defines the quality characteristics for a software product, or you can review the enterprise onboarding checklist.
For example, the questionnaire can ask the following questions:
- Can your infrastructure and its components scale vertically or horizontally?
- Does your infrastructure support rolling back changes without manual intervention?
- Do you already have a monitoring system that covers your infrastructure and your workloads?
- Do you have an incident management system for your infrastructure?
- How much time and effort does it take to implement the planned optimizations?
- Were you able to meet all goals in your past iterations?
Starting from the answers to the questionnaire, you draft the list of requirements for this optimization iteration. For example, your requirements might be the following:
- Increase the performance of an application.
- Increase the availability of a component of your environment.
- Increase the reliability of a component of your environment.
- Reduce the operational costs of your environment.
- Shorten the duration of the optimization iteration to reduce the inherent risks.
- Increase development velocity and reduce time-to-market.
When you have the list of improvement areas, evaluate the requirements in the list. In this evaluation, you analyze your optimization requirements, look for conflicts, and prioritize the requirements in the list. For example, increasing the performance of an application might conflict with operational cost reduction.
Set measurable goals
After you finalize the list of requirements, define measurable goals for each requirement. A goal might contribute to more than one requirement. If you have any area of uncertainty or if you're not able to define all goals that you need to cover your requirements, go back to the assessment phase of this iteration to gather any missing information, and then refine your requirements.
For help defining these goals, you can follow one of the SRE disciplines, the definition of service level indicators (SLIs) and service level objectives (SLOs):
- SLIs are quantitative measures of the level of service that you provide. For example, a key SLI might be the average request latency, error rate, or system throughput.
- SLOs are target values or ranges of values for a service level that is measured by an SLI. For example, an SLO might be that the average request latency is lower than 100 milliseconds.
After defining SLIs and SLOs, you might realize that you're not gathering all metrics that you need to measure your SLIs. This metrics collection is the first optimization goal that you can tackle. You set the goals related to extending your monitoring system to gather all metrics that you need for your SLIs.
Optimizing your environment and your teams
After assessing your environment, teams, and optimization loop, as well as establishing requirements and goals for this iteration, you're ready to perform the optimization step.
In this step, you perform the following activities:
- Measure your environment, teams, and optimization loop.
- Analyze the data coming from these measurements.
- Perform the optimization activities.
- Measure and analyze again.
Measure your environment, teams, and optimization loop
You extend your monitoring system to gather data about the behavior of your environment, teams, and the optimization loop to establish a baseline against which you can compare after optimizing.
This activity builds on and extends what you did in the assessment phase. After you establish your requirements and goals, you know which metrics to gather for your measurements to be relevant to your optimization goals. For example, if you defined SLOs and the corresponding SLIs to reduce the response latency for one of the workloads in your environment, you need to gather data to measure that metric.
Understanding these metrics also applies to your teams and to the optimization loop. You can extend your monitoring system to gather data so that you measure the metrics relevant to your teams and the optimization loop. For example, if you have SLOs and SLIs to reduce the duration of the optimization iteration, you need to gather data to measure that metric.
When you design the metrics that you need to extend the monitoring system, take into account that gathering data might affect the performance of your environment and your processes. Evaluate the metrics that you need to implement for your measurements, and their sample intervals, to understand if they might affect performance. For example, a metric with a high sample frequency might degrade performance, so you need to optimize further.
On Google Cloud, you can use Cloud Monitoring to implement the metrics that you need to gather data. You can also implement custom metrics and gather data from your on-premises or hybrid cloud environments. To implement custom metrics in your workloads directly, you can use Cloud Client Libraries for Cloud Monitoring, OpenCensus, or OpenTelemetry. If you're using Google Kubernetes Engine (GKE), you can use GKE usage metering to gather information about resource usage, such as CPU, GPU, and TPU usage, and then divide resource usage by namespace or label.
After gathering your data, you analyze and evaluate it to understand how your environment, teams, and optimization loop are performing against your optimization requirements and goals.
In particular, you evaluate your environment against the following:
- Industry best practices.
- An environment without any technical debt.
The SLOs that you established according to your optimization goals can help you understand if you're meeting your expectations. If you're not meeting your SLOs, you need to enhance your teams or the optimization loop. For example, if you established an SLO for the response latency for a workload to be in a given percentile and that workload isn't meeting that mark, that is a signal that you need to optimize that part of the workload.
Additionally, you can compare your situation against a set of recognized best practices in the industry. For example, the enterprise onboarding checklist helps you configure a production-ready environment for enterprise workloads. Or if you're using GKE, you can check if your environment is production ready and if you're following the best practices for operating containers.
After collecting data, you can consider how to optimize your environment to make it more cost efficient. You can export Cloud Billing data to BigQuery and create a Cloud Billing dashboard with Data Studio to analyze the data, understand how many resources you're using, and extract any spending pattern from it.
Finally, you compare your environment to one where you don't have any technical debt, to see whether you're meeting your long-term goals and to see if the technical debt is increasing. For example, you might establish an SLO for how many resources in your environment you're monitoring versus how many resources you have provisioned since the last iteration. If you didn't extend the monitoring system to cover those new resources, your technical debt increased. When analyzing the changes in your technical debt, also consider the factors that led to those changes. For example, a business need might require an increment in technical debt, or it might be unexpected. Knowing the factors that caused a change in your technical debt gives you insights for future optimization targets.
To monitor your environment on Google Cloud, you can use Monitoring to design charts, dashboards, and alerts. You can then export Cloud Logging data for a more in-depth analysis and extended retention period. For example, you can create aggregated sinks and use Cloud Storage, Pub/Sub, or BigQuery as destinations. If you export data to BigQuery, you can then use Data Studio to visualize data so that you can identify trends and make predictions. You can also use evaluation tools such as Recommender, Security Command Center, and Forseti to automatically analyze your environment and processes, looking for optimization targets.
After you analyze all of the measurement data, you need to answer two questions:
Are you meeting your optimization goals?
If you answered yes, then this optimization iteration is completed, and you can start a new one. If you answered no, you can move to the second question.
Given the resources that you budgeted, can you achieve the optimization goals that you set for this iteration?
To answer this question, consider all resources that you need, such as time, money, and expertise. If you answered yes, you can move to the next section; otherwise, refine your optimization goals, considering the resources you can use for this iteration. For example, if you're constrained by a fixed schedule, you might need to schedule some optimization goals for the next iteration.
Optimize your teams
Optimizing the environment is a continuous challenge and can require skills that your teams might lack, which you discovered during the assessment and the analysis. For this reason, optimizing your teams by acquiring new skills and making your processes more efficient is crucial to the success of your optimization activities.
To optimize your teams, you need to do the following:
- Design and implement a training program.
- Optimize your team structure and culture.
For your teams to acquire the skills that they are missing, you need to design and implement a training program or choose one that professional Google Cloud trainers prepared. For more information, see Migration to Google Cloud: Assessing and discovering your workloads.
While optimizing your teams, you might find that there is room to improve structure and culture. It's difficult to prescribe an ideal situation upfront, because every company has its own history and idiosyncrasies that contributed to the evolution of your teams's structure and culture.
DevOps culture: how to transform is a good starting point to learn general frameworks for executing and measuring organizational changes aimed at adopting DevOps practices. The principles and common pitfalls sections can help you avoid future obstacles while optimizing your teams. For practical guidance on how to implement an effective DevOps culture in your organization, refer to Site Reliability Engineering, a comprehensive description of the SRE methodology. The Site Reliability Workbook, the hands-on companion to the book, uses concrete examples to show you how to put SRE principles and practices to work.
SRE suggests implementing a blameless postmortem culture to let your teams learn from their failures. If you're interested in starting your SRE journey to optimize your teams, you can find resources in the SRE books and the Customer Reliability Engineering (CRE) blog. You can also read re:Work, guides, and case studies to research and introduce practices for data-driven process improvements.
Optimize your environment
After measuring and analyzing metrics data, you know which areas you need to optimize.
This section covers general optimization techniques for your Google Cloud environment. You can also perform any optimization activity that's specific to your infrastructure and to the services that you're using.
One of the biggest advantages of adopting a public cloud environment like Google Cloud, is that you can use well-defined interfaces such as Cloud APIs to provision, configure, and manage resources. You can define your infrastructure as code (IaC), and you can version it using your version control system of choice.
You can use tools such as Deployment Manager and Terraform to provision, and Ansible, Chef, or Puppet to deploy your Google Cloud resources. An IaC process helps you implement an effective rollback strategy for your optimization tasks. You can revert any change that you applied to the code that describes your infrastructure. Also, you can avoid unexpected failures while updating your infrastructure by testing your changes.
Therefore, if you adopt an IaC process in the early optimization iterations, you can define further optimization activities as code. You can also adopt the process gradually, so you can evaluate if it's suitable to your environment.
To completely optimize your entire environment, you need to use resources efficiently. This means that you need to eliminate toil to save resources and to reinvest in more important tasks that produce value, like optimization activities.
Per the SRE recommendation, the way to eliminate toil is by increasing automation. Not all automation tasks require highly specialized software engineerings skills or great efforts. Sometimes a short executable script executed periodically can save several hours per day. Google Cloud provides tools such as Cloud SDK and managed services such as Cloud APIs, Cloud Scheduler, Cloud Composer, and Cloud Functions that your teams can use to automate repetitive tasks.
For example, you can deploy continuous delivery pipelines with Spinnaker and Google Kubernetes Engine to automate build, test, and deploy tasks using Cloud Build. Or, you can automate complex data-processing workflow by integrating multiple Google Cloud services to produce datasets for your services.
If you can't gather detailed measures about your environment, you can't improve it, because you lack data to back up your assumptions. This means that you don't know what to do to meet your optimization goals.
A comprehensive monitoring system is a necessary component for your environment. The system monitors all essential metrics that you need to evaluate for your optimization goals. When you design your monitoring system, plan to monitor the four golden signals at minimum.
You can use managed services such as Monitoring and Logging to monitor your environment without having to set up a complicated monitoring solution. You can also extend your monitoring system using Monitoring by integrating with open source applications. For example, you can perform white-box app monitoring for GKE with Prometheus or capture Cloud Bigtable tracing and metrics using OpenCensus.
You might need to implement a monitoring system that can monitor hybrid and multi-cloud environments to satisfy data restriction policies that force you to store data only in certain physical locations, or services that use multiple cloud environments simultaneously.
Adopt a cloud-native approach
Cloud native is a paradigm that describes an efficient way for designing and running an application on the cloud. The Cloud Native Computing Foundation (CNCF) defines a cloud-native application as an application that is scalable, resilient, manageable, and observable by technologies such as containers, service meshes, microservices, immutable infrastructure, and declarative APIs. Google Cloud provides managed services such as GKE, Cloud Run, Traffic Director, Logging, and Monitoring to empower users to design and run cloud-native applications.
If you have a legacy monolithic application that it's difficult to modernize by applying the cloud-native paradigm, you can migrate it to microservices on GKE.
Because of their different billing and cost models, optimizing costs of a public cloud environment like Google Cloud is different than optimizing an on-premises environment. This section covers optimization techniques that you can apply to manage your costs on Google Cloud.
First, create a budget in Cloud Billing to track costs, breaking them down by Cloud project or label. You can also configure alerts or trigger next actions through Pub/Sub messages and Cloud Function to manage your costs.
If you're using Compute Engine, you can use the virtual machine (VM) instance sizing recommender to optimize the resource utilization of your VMs, based on metrics from Monitoring. Or you can use recommendations for IaC to create automated workflows to reduce costs.
If your workloads tolerate it, you can use preemptible VM instances, which are highly affordable, short-lived VM instances.
Google Cloud services are billed with a customer-friendly pricing model. To further reduce your billing, Google Cloud has features such as committed use discounts, where you purchase a contract in advance in return for discounted prices. Or you can apply sustained use discounts to receive automatic discounts to your Compute Engine billing.
For some Google Cloud services, there are discounted plans that reduce your costs and let you plan your budget with a more predictable pricing model, such as flat-rate pricing for BigQuery, and the storage growth plan for Cloud Storage.
Lastly, apply these cost management techniques with automation in mind. Given that you can apply IaC, you can define cost management tasks as code and automate them to reduce costs without any manual interaction.
Measure and analyze again
When you complete the optimization activities for this iteration, you repeat the measurements and the analysis to check if you reached your goals. Answer the following question:
Did you meet your optimization goals?
If you answered yes, you can move to the next section.
If you answered no, go back to the beginning of this phase.
Tuning the optimization loop
In this section, you update and modify the optimization loop that you followed in this iteration to better fit your team structure and environment.
Codify the optimization loop
To optimize the optimization loop efficiently, you need to document and define the loop in a form that is standardized, straightforward, and easy to manage, allowing room for changes. You can use a fully managed service such as Cloud Composer to create, schedule, monitor, and manage your workflows. You can also first represent your processes with a language such as the business process model and notation (BPMN). After that, you can codify these processes with a standardized language such as the business process execution language (BPEL). After adopting IaC, describing your processes with code lets you manage them as you do the rest of your environment after adopting IaC.
Automate the optimization loop
After you codify the optimization loop, you can automate repetitive tasks to eliminate toil, save time, and make the optimization loop more efficient. You can start automating all tasks where a human decision is not required, such as measuring data and producing aggregate reports for your teams to analyze. For example, you can automate data analysis with Cloud Monitoring Service monitoring, to check if your environment meets the SLOs that you defined. Given that optimization is a never-ending task and that you iterate on the optimization loop, even small automations can significantly increase efficiency.
Monitor the optimization loop
As you did for all the resources in your environment, you need to monitor the optimization loop to verify that it's working as expected and also look for bottlenecks and future optimization goals. You can start monitoring it by tracking how much time and how many resources your teams spent on each optimization step. For example, you can use an issue tracking system and a project management tool to monitor your processes and extract relevant statistics about metrics like issue resolution time and time to completion.
Google Cloud provides the following support resources:
- Self-service resources. If you don't need dedicated support, you have various options that you can use at your own pace.
- Technology partners:. Google Cloud has partnered with multiple companies to help you use our products and services.
- Google Cloud professional services. Our professional services can help you get the most out of your investment in Google Cloud.
There are more resources to help migrate workloads to Google Cloud in the Google Cloud Migration Center. For more information about these resources, see the finding help section of Migration to Google Cloud: Getting started.