This document introduces the structures used to represent services and SLOs in the SLO API and maps them to the concepts described in generally in Concepts in service monitoring.
The SLO API is used to set up service-level objectives (SLOs) that can be used to monitor the health of your services.
Service Monitoring adds the following resources to the Monitoring API:
For information on invoking the API, see Working with the API.
Services
A service is represented by a Service
object.
This object includes the following fields:
- A name: A fully qualified resource name for this service
- A display name: A label for use in console components
- A structure for one of the
BasicService
types. - A system-provided telemetry-configuration object
To define a basic service, you specify the type of service and provide a set of service-specific labels that describe the service:
{ "serviceType": string, "serviceLabels": { string: string, ... } }
The following sections provide examples for each type of service.
Basic service types
This section provides examples of services definitions built on the
BasicService
type, where the value of the
serviceType
field is one of the following:
APP_ENGINE
CLOUD_ENDPOINTS
CLUSTER_ISTIO
ISTIO_CANONICAL_SERVICE
CLOUD_RUN
Each of these service types uses the BasicSli
service-level indicator.
App Engine
{ "displayName": "DISPLAY_NAME", "basicService": { "serviceType": "APP_ENGINE", "serviceLabels": { "module_id": "MODULE_ID" }, }, }
Cloud Endpoints
{ "displayName": "DISPLAY_NAME", "basicService": { "serviceType": "CLOUD_ENDPOINTS", "serviceLabels": { "service": "SERVICE" }, }, }
Cluster Istio
{ "displayName": "DISPLAY_NAME", "basicService": { "serviceType": "CLUSTER_ISTIO", "serviceLabels": { "location": "LOCATION", "cluster_name": "CLUSTER_NAME", "service_namespace": "SERVICE_NAMESPACE", "service_name": "SERVICE_NAME" }, }, }
Istio Canonical Service
{ "displayName": "DISPLAY_NAME", "basicService": { "serviceType": "ISTIO_CANONICAL_SERVICE", "serviceLabels": { "mesh_uid": "MESH_UID", "canonical_service_namespace": "CANONICAL_SERVICE_NAMESPACE", "canonical_service": "CANONICAL_SERVICE" }, }, }
Cloud Run
{ "displayName": "DISPLAY_NAME", "basicService": { "serviceType": "CLOUD_RUN", "serviceLabels": { "service_name": "SERVICE_NAME", "location": "LOCATION" }, }, }
Basic GKE service types
This section contains examples of GKE service definitions
built on the BasicService
type, where the value
of the serviceType
field is one of the following:
GKE_NAMESPACE
GKE_WORKLOAD
GKE_SERVICE
You must define SLIs for these service types. They can't use
BasicSli
service-level indicators.
For more information, see Service-level indicators.
GKE namespace
{ "displayName": "DISPLAY_NAME", "basicService": { "serviceType": "GKE_NAMESPACE", "serviceLabels": { "project_id": "PROJECT_ID", "location": "LOCATION", "cluster_name": "CLUSTER_NAME", "namespace_name": "NAMESPACE_NAME" } }, }
GKE workload
{ "displayName": "DISPLAY_NAME", "basicService": { "serviceType": "GKE_WORKLOAD", "serviceLabels": { "project_id": "PROJECT_ID", "location": "LOCATION", "cluster_name": "CLUSTER_NAME", "namespace_name": "NAMESPACE_NAME", "top_level_controller_type": "TOPLEVEL_CONTROLLER_TYPE", "top_level_controller_name": "TOPLEVEL_CONTROLLER_NAME", } }, }
GKE service
{ "displayName": "DISPLAY_NAME", "basicService": { "serviceType": "GKE_SERVICE", "serviceLabels": { "project_id": "PROJECT_ID", "location": "LOCATION", "cluster_name": "CLUSTER_NAME", "namespace_name": "NAMESPACE_NAME", "service_name": "SERVICE_NAME" } }, }
Custom services
You can create custom services if none of the basic service types is suitable. A custom service looks like the following:
{ "displayName": "DISPLAY_NAME", "custom": {} }
You must define SLIs for these service types. They can't use
BasicSli
service-level indicators.
For more information, see Service-level indicators.
Service-level indicators
A service-level indicator (SLI) provides a measure of the performance of a service. An SLI is based on metric captured by the service. Exactly how the SLI is defined depends on the type of metric used as the indicator metric, but it is generally some comparison between acceptable results and total results.
A SLI is represented by the
ServiceLevelIndicator
object. This object is
a collective way to refer the three supported types of
SLIs:
A basic SLI, which is created automatically for instances of the the
BasicService
service type. This type of SLI is described in Service-level-objectives; it is represented by aBasicSli
object and measures availability or latency.A request-based SLI, which you can use to count events that represent acceptable service. Use of this type of SLI is described in Request-based SLOs; it is represented by a
RequestBasedSli
object.A window-based SLI, which you can use to count periods of time that meet some goodness criterion. Use of this type of SLI is described in Windows-based SLOs; it is represented by a
WindowsBasedSli
object.
For example, the following shows a basic availability SLI:
{ "basicSli": { "availability": {}, "location": [ "us-central1-c" ] } }
Structures for request-based SLIs
A request-based SLI is based on a metric that counts units of service as a ratio between a particular outcome and the total. For example, if you use a metric that counts requests, you can build the ratio between the number of requests that return success and the total number of requests.
There are two ways to build a request-based SLI:
- As a
TimeSeriesRatio
, when the ratio of good service to total service is computed from two time series whose values have a metric kind ofDELTA
orCUMULATIVE
. - As a
DistributionCut
, when the time series has value typeDISTRIBUTION
and whose values have a metric kind ofDELTA
orCUMULATIVE
. The good-service value is the count of items that fall into the histogram buckets in a specified range, and the total is the count of all values in the distribution.
The following shows the JSON representation of an SLI that uses a time-series ratio:
{ "requestBased": { "goodTotalRatio": { "totalServiceFilter": "resource.type=https_lb_rule metric.type="loadbalancing.googleapis.com/https/request_count"", "goodServiceFilter": "resource.type=https_lb_rule metric.type="loadbalancing.googleapis.com/https/request_count" metric.label.response_code_class=200", } } }
The time series in this ratio are identified by the pair of monitored-resource type and metric type:
- Resource:
https_lb_rule
- Metric type:
loadbalancing.googleapis.com/https/request_count
The value for the totalServiceFilter
is represented by the pair of
metric and resource type. The value for the goodServiceFilter
is represented
by the same pair but where some label has a particular value; in this case,
when the value of the response_code_class
label is 200
.
The ratio between the filters measures the number of requests that return a 2xx HTTP status over the total number of requests.
The following shows the JSON representation of an SLI that uses a distribution cut:
{ "requestBased": { "distribution_cut": { "distribution_filter": "resource.type=https_lb_rule metric.type="loadbalancing.googleapis.com/https/backend_latencies" metric.label.response_code_class=200", "range": { "min": "-Infinity", "max": 500.0 } } } }
The time series is identified by the monitored-resource type, metric type, and value for a metric label:
- Resource:
https_lb_rule
- Metric type:
loadbalancing.googleapis.com/https/backend_latencies
- Label-value pair:
response_code_class
=200
The range of latencies considered good is designated by the range
field.
This SLI computes the ratio of latencies of 2xx-class responses below 500
to the latencies of all 200-class responses.
Structures for windows-based SLIs
A windows-based SLI counts time windows in which the provided service is considered good. The criterion for determining how good service is part of the SLI definition.
All windows-based SLIs include a window period, 60–86,400 seconds (1 day).
There are two ways to specify the good-service criterion for a windows-based SLI:
- Create a filter string, described in Monitoring filters that
returns a time series with boolean values. A window is good if the value
for that window is
true
. This filter is called thegoodBadMetricFilter
. Create a
PerformanceThreshold
object that represents a threshold for acceptable performance. This object is specified as the value of thegoodTotalRatioThreshold
.A
PerformanceThreshold
object specifies a threshold value and a performance SLI. If the value of the performance SLI meets or exceeds the threshold value, then the time window counts as good.There are two ways to specify the performance SLI:
- As a
BasicSli
object in thebasicPerformanceSli
field. - As a
RequestBasedSli
object in theperformance
field. If you use a request-based SLI, then the metric kind of your SLI must beDELTA
orCUMULATIVE
. You can't useGAUGE
metrics in request-based SLIs.
- As a
The following shows the JSON representation a windows-based SLI built on a performance threshold for a basic availability SLI:
{ "windowsBased": { "goodTotalRatioThreshold": { "threshold": 0.9, "basicSliPerformance": { "availability": {}, "location": [ "us-central1-c" ] } }, "windowPeriod": "300s" } }
This SLI specifies good performance as a 5-minute window in which availability reaches 90% or better. The structure of a basic SLI is shown in Service-level indicators.
You can also embed a request-based SLI in the windows-based SLI. For more information on the embedded structures, see Structures for request-based SLIs.
Service-level objectives
A service-level objective (SLO) is represented by a
ServiceLevelObjective
object. This object includes the following
fields:
- A name
- A display name
- The target SLI; an embedded
ServiceLevelIndicator
object - The performance goal for the SLI
- The compliance period for the SLI
The following shows the JSON representation of an SLO that uses a basic
availability SLI as the value of the serviceLevelIndicator
field:
{ "name": "projects/PROJECT_NUMBER/services/PROJECT_ID-zone-us-central1-c-csm-main-default-currencyservice/serviceLevelObjectives/3kavNVTtTMuzL7KcXAxqCQ", "serviceLevelIndicator": { "basicSli": { "availability": {}, "location": [ "us-central1-c" ] } }, "goal": 0.98, "calendarPeriod": "WEEK", "displayName": "98% Availability in Calendar Week" }
This SLO sets the performance goal at 98 percent availability over a period of a week.