Tutorial: Manage services with Anthos


Anthos Service Mesh provides Anthos users with tools to monitor and manage reliable microservice-based applications. This tutorial uses the Anthos Sample Deployment on Google Cloud to introduce you to some of Anthos Service Mesh's service management features by showing you how to define a service level objective (SLO). The Anthos Sample Deployment deploys a real Anthos hands-on environment with a GKE cluster, service mesh, and a Bank of Anthos application with multiple microservices.

What is an SLO?

According to Google's Site Reliability Engineering (SRE) book:

It's impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they use an internal API or a public product.

Google SRE teams use service level indicators (SLIs), service level objectives (SLOs), and service level agreements (SLAs) to structure and guide the metrics that inform their work. An SLI is a quantitative measure of some aspect of how your service is performing, such as its latency or availability, while an SLO is a target value ("this should happen x% of the time") for a service level that is measured by an SLI. Anthos Service Mesh makes it easy to define and refine SLOs for your own services. It gives you the information that you need to identify appropriate SLIs and SLOs, and notifies you when your service isn't meeting its SLOs.

To find out more about SLOs and SLIs in Anthos Service Mesh, see the SLO overview and Designing SLOs.

Objectives

In this tutorial, you're introduced to managing services with Anthos Service Mesh in Anthos through the following tasks:

  • Identify a service level indicator (SLI) for a service

  • Use a service level objective (SLO) to monitor for unexpected behavior.

Costs

Using the Anthos Sample Deployment will incur pay-as-you-go charges for Anthos on Google Cloud as listed on our Pricing page unless you have an Anthos subscription.

You are also responsible for other Google Cloud costs incurred while running the Anthos Sample Deployment, such as charges for Compute Engine VMs and load balancers. You can see an estimated monthly cost for all these resources on the deployment's Google Cloud Marketplace page.

We recommend cleaning up after finishing the tutorial or exploring the deployment to avoid incurring further charges. The Anthos Sample Deployment is not intended for production use and its components cannot be upgraded.

Before you begin

This tutorial is a follow up to the Explore Anthos tutorial. Before starting this tutorial, follow the instructions on that page to set up your project and install the Anthos Sample Deployment.

Identifying SLIs

Anthos Service Mesh makes gathering SLIs and defining your SLOs a simple, straightforward task. In our example, you decide to first define an SLO for the Bank of Anthos' ledgerwriter service.

First use Anthos Service Mesh to find information that you could use to identify an SLI for the service.

  1. Go to the Anthos Service Mesh page in the project where you installed the Anthos Sample Deployment.

    Go to the Anthos Service Mesh page

    The top part of this view shows the current status of your application's services along with indicators for alerts and SLOs, including the count of services without SLOs; currently all of the services are under No SLOs set. In addition, in the Status column, all of the services have a black circle indicator. If you hold the pointer over that indicator for any service, you're informed that no SLO is set for the service.

  2. Note the value in ms for 99% latency for ledgerwriter (you may need to scroll down and across to see it). This metric means that one out of every 100 requests experiences this level of delay. You will use this value in the next section.

Creating an SLO

Now create an SLO against a latency SLI for the service. To see what happens when a service exceeds its error budget, set a threshold that's deliberately low, based on the information that you saw in the previous section. For a real production service you'd try to find a threshold latency value no lower than is necessary for your users to have a good experience from your application.

  1. In the Anthos Service Mesh Table view, click ledgerwriter to go to the service overview page.

  2. Under Service status, click Create an SLO.

  3. In the SLI Type list, select Latency.

  4. Leave the default Request-based evaluation method, and click Continue.

  5. Set Latency Threshold to an arbitrarily low value, such as 10 ms (something significantly lower than the 99% latency value you observed earlier), and click Continue again.

  6. In Compliance Period, set Period Type to Rolling, and Period Length to 1 Day.

  7. In SLO Goal, set the Compliance target to 90%. Anthos Service Mesh uses this value to calculate the error budget that you have for this SLO; that is, the maximum percentage of requests that should exceed your specified latency threshold. A Preview shows you how your SLO would have performed in the most recent one day period. Click Continue.

  8. The Name your SLO section suggests a default name for your new SLO: you can accept the recommended default or specify a new name. To create the SLO and go to the Health page for the ledgerwriter, click Create SLO.

Click the drop-down arrow to see more details about your SLO. You should see that the SLO is Out of Error Budget based on your settings. You can also edit or delete the SLO from this view.

Screenshot of Anthos Service Mesh service health view

Rechecking SLO and alert indicators

  1. On the service overview page, click the back arrow to return to the table view. Now you can see that the service count for No SLOs set has been reduced by one and that SLOs out of error budget is no longer 0.

  2. If you scroll down to ledgerwriter, notice that the adjacent indicator has changed to an orange warning triangle. If you hold the pointer over that indicator, you're told to investigate service reliability. Clicking the indicator brings you back to the service's Health page to review your SLO details. The same indicator also appears for your service in the topology view.

Screenshot of Anthos Service Mesh service list with SLO warning

Exploring the deployment further

There's still lots more to see and do in Anthos with our deployment. Feel free to try another tutorial or continue to explore the Anthos Sample Deployment on Google Cloud yourself, before following the cleanup instructions in the next section.

Clean up

After you've finished exploring the Anthos Sample Deployment, you can clean up the resources that you created on Google Cloud so they don't take up quota and you aren't billed for them in the future. The following sections describe how to delete or turn off these resources.

  • Option 1. You can delete the project. This is the recommended approach. However, if you want to keep the project around, you can use Option 2 to delete the deployment.

  • Option 2. (Experimental) If you're working within an existing but empty project, you may prefer to manually revert all the steps from this tutorial, starting with deleting the deployment.

  • Option 3. (Experimental) If you're an expert on Google Cloud or have existing resources in your cluster, you may prefer to manually clean up the resources that you created in this tutorial.

Delete the project (option 1)

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the deployment (option 2)

This approach relies on allowing Deployment Manager to undo what it created. Even if the deployment had errors, you can use this approach to undo it.

  1. In the Google Cloud console, on the Navigation menu, click Deployment Manager.

  2. Select your deployment, and then click Delete.

  3. Confirm by clicking Delete again.

  4. Even if the deployment had errors, you can still select and delete it.

  5. If clicking Delete doesn't work, as a last resort you can try Delete but preserve resources. If Deployment Manager is unable to delete any resources, you need to note these resources and attempt to delete them manually later.

  6. Wait for Deployment Manager to finish the deletion.

  7. (Temporary step) On the Navigation menu, click Network services > Load balancing, and then delete the forwarding rules created by the anthos-sample-cluster1 cluster.

  8. (Optional) Go to https://source.cloud.google.com/<project_id>. Delete the repository whose name includes config-repo if there is one.

  9. (Optional) Delete the Service Account that you created during the deployment and all of its IAM roles.

Perform a manual cleanup (option 3)

This approach relies on manually deleting the resources from the Google Cloud console.

  1. In the Google Cloud console, on the Navigation menu, click Kubernetes Engine.

  2. Select your cluster and click Delete, and then click Delete again to confirm.

  3. In the Google Cloud console, on the Navigation menu, click Compute Engine.

  4. Select the jump server and click Delete, and then click Delete again to confirm.

  5. Follow Steps 7 and 8 of Option 2.

If you plan to redeploy after the manual cleanup, verify that all requirements are met as described in the Before you begin section.

What's next

There's lots more to explore in our Anthos documentation.

Try more tutorials

  • Explore Anthos security features with the Anthos Sample Deployment in Secure Anthos.

  • Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.

Learn more about Anthos