Anthos Service Mesh provides Anthos users with tools to monitor and manage reliable microservice-based applications. This tutorial uses the Anthos Sample Deployment on Google Cloud to introduce you to some of Anthos Service Mesh's service management features by showing you how to define a service level objective (SLO). The Anthos Sample Deployment deploys a real Anthos hands-on environment with a GKE cluster, service mesh, and a Bank of Anthos application with multiple microservices.
What is an SLO?
According to Google's Site Reliability Engineering (SRE) book:
It's impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they use an internal API or a public product.
Google SRE teams use service level indicators (SLIs), service level objectives (SLOs), and service level agreements (SLAs) to structure and guide the metrics that inform their work. An SLI is a quantitative measure of some aspect of how your service is performing, such as its latency or availability, while an SLO is a target value ("this should happen x% of the time") for a service level that is measured by an SLI. Anthos Service Mesh makes it easy to define and refine SLOs for your own services. It gives you the information that you need to identify appropriate SLIs and SLOs, and notifies you when your service isn't meeting its SLOs.
In this tutorial, you're introduced to managing services with Anthos Service Mesh in Anthos through the following tasks:
Identify a service level indicator (SLI) for a service
Use a service level objective (SLO) to monitor for unexpected behavior.
Using the Anthos Sample Deployment will incur pay-as-you-go charges for Anthos on Google Cloud as listed on our Pricing page unless you have an Anthos subscription.
You are also responsible for other Google Cloud costs incurred while running the Anthos Sample Deployment, such as charges for Compute Engine VMs and load balancers. You can see an estimated monthly cost for all these resources on the deployment's Google Cloud Marketplace page.
We recommend cleaning up after finishing the tutorial or exploring the deployment to avoid incurring further charges. The Anthos Sample Deployment is not intended for production use and its components cannot be upgraded.
Before you begin
This tutorial is a follow up to the Explore Anthos tutorial. Before starting this tutorial, follow the instructions on that page to set up your project and install the Anthos Sample Deployment.
Anthos Service Mesh makes gathering SLIs and defining your SLOs a simple,
straightforward task. In our example, you decide to first define an SLO for
the Bank of Anthos'
First use Anthos Service Mesh to find information that you could use to identify an SLI for the service.
Go to the Anthos Service Mesh page in the project where you installed the Anthos Sample Deployment.
The top part of this view shows the current status of your application's services along with indicators for alerts and SLOs, including the count of services without SLOs; currently all of the services are under No SLOs set. In addition, in the Status column, all of the services have a black circle indicator. If you hold the pointer over that indicator for any service, you're informed that no SLO is set for the service.
Note the value in ms for 99% latency for
ledgerwriter(you may need to scroll down and across to see it). This metric means that one out of every 100 requests experiences this level of delay. You will use this value in the next section.
Creating an SLO
Now create an SLO against a latency SLI for the service. To see what happens when a service exceeds its error budget, set a threshold that's deliberately low, based on the information that you saw in the previous section. For a real production service you'd try to find a threshold latency value no lower than is necessary for your users to have a good experience from your application.
In the Anthos Service Mesh Table view, click ledgerwriter to go to the service overview page.
Under Service status, click Create an SLO.
In the SLI Type list, select Latency.
Set Latency Threshold to an arbitrarily low value, such as
10 ms(something significantly lower than the 99% latency value you observed earlier).
In SLO Goal, set the Compliance target to
90%. Anthos Service Mesh uses this value to calculate the error budget that you have for this SLO; that is, the maximum percentage of requests that should exceed your specified latency threshold.
In Compliance Period, set Period Type to
Rolling, and Period Length to
1 Day. The panel How your SLO would have performed appears. The Name your SLO section suggests a default name for your new SLO.
To create the SLO and go to the Health page for the
ledgerwriter, click Submit.
Click the drop-down arrow to see more details about your SLO. You should see that the SLO is Out of Error Budget based on your settings. You can also edit or delete the SLO from this view.
Rechecking SLO and alert indicators
On the service overview page, click the back arrow to return to the table view. Now you can see that the service count for No SLOs set has been reduced by one and that SLOs out of error budget is no longer 0.
If you scroll down to ledgerwriter, notice that the adjacent indicator has changed to an orange warning triangle. If you hold the pointer over that indicator, you're told to investigate service reliability. Clicking the indicator brings you back to the service's Health page to review your SLO details. The same indicator also appears for your service in the topology view.
Exploring the deployment further
There's still lots more to see and do in Anthos with our deployment. Feel free to try another tutorial or continue to explore the Anthos Sample Deployment on Google Cloud yourself, before following the cleanup instructions in the next section.
After you've finished exploring the Anthos Sample Deployment, you can clean up the resources that you created on Google Cloud so they don't take up quota and you aren't billed for them in the future. The following sections describe how to delete or turn off these resources.
Option 1. You can delete the project. This is the recommended approach. However, if you want to keep the project around, you can use Option 2 to delete the deployment.
Option 2. (Experimental) If you're working within an existing but empty project, you may prefer to manually revert all the steps from this tutorial, starting with deleting the deployment.
Option 3. (Experimental) If you're an expert on Google Cloud or have existing resources in your cluster, you may prefer to manually clean up the resources that you created in this tutorial.
Delete the project (option 1)
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the deployment (option 2)
This approach relies on allowing Deployment Manager to undo what it created. Even if the deployment had errors, you can use this approach to undo it.
In the Cloud Console, on the Navigation menu, click Deployment Manager.
Select your deployment, and then click Delete.
Confirm by clicking Delete again.
Even if the deployment had errors, you can still select and delete it.
If clicking Delete doesn't work, as a last resort you can try Delete but preserve resources. If Deployment Manager is unable to delete any resources, you need to note these resources and attempt to delete them manually later.
Wait for Deployment Manager to finish the deletion.
(Temporary step) On the Navigation menu, click Network services > Load balancing, and then delete the forwarding rules created by the
(Optional) Go to
https://source.cloud.google.com/<project_id>. Delete the repository whose name includes config-repo if there is one.
(Optional) Delete the Service Account that you created during the deployment and all of its IAM roles.
Perform a manual cleanup (option 3)
This approach relies on manually deleting the resources from the Google Cloud Console.
In the Cloud Console, on the Navigation menu, click Kubernetes Engine.
Select your cluster and click Delete, and then click Delete again to confirm.
In the Cloud Console, on the Navigation menu, click Compute Engine.
Select the jump server and click Delete, and then click Delete again to confirm.
Follow Steps 7 and 8 of Option 2.
If you plan to redeploy after the manual cleanup, verify that all requirements are met as described in the Before you begin section.
There's lots more to explore in our Anthos documentation.
Try more tutorials
Explore Anthos security features with the Anthos Sample Deployment in Secure Anthos.
Try out other Google Cloud features for yourself. Have a look at our tutorials.
Learn more about Anthos
Learn more about Anthos in our technical overview.
Find out how to set up Anthos in a real production environment in our setup guide.
Find out how to do more with Anthos Service Mesh in the Anthos Service Mesh documentation.
Take our survey
When you finish working on this tutorial, please complete our survey. We're interested in hearing about any issues you might have at any point in the tutorial. Thanks for using the survey to submit your feedback.
The Anthos Team