The Google Cloud data services discussed on this page include those that process provided data and output the results of that processing, either in response to a request or continuously. Rather than using availability and latency as the primary SLIs for these services, more appropriate choices are the following:
- Correctness, a measure of how many processing errors the pipeline incurs.
- Freshness, a measure of how quickly data is processed.
For further information on data pipelines from the SRE perspective, see Data Processing Pipelines in the Site Reliability Engineering Workbook.
You express a request-based correctness SLI by using the
TimeSeriesRatio
structure to set up a
ratio of items that had processing problems to all items processed.
You decide how to filter the metric by using
its available labels to arrive at your preferred determination of "problem"
and "valid" totals.
You express a request-based freshness SLI by using a
DistributionCut
structure.
Dataflow
Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost. You can use Dataflow to process data as a stream or in batches using the Apache Beam SDK.
For additional information, see the following:
- Documentation for Dataflow.
- List of
dataflow.googleapis.com
metric types.
Correctness SLIs
Dataflow writes metric data to Cloud Monitoring using the
dataflow_job
monitored-resource type and the
job/element_count
metric type, which counts the number of
elements added to the pcollection so far. Summing across the job_name
resource label gives you the number of elements to be processed by the job.
Separately, you can use the
metric type with the
logging.googleapis.com
/log_entry_countdataflow_job
monitored-resource type to count the number of errors
logged by a particular job, by using the severity
metric label.
You can use these metrics to express a request-based correctness SLI as a
fraction of errors and all processed elements by using a
TimeSeriesRatio
structure, as shown in the
following example:
"serviceLevelIndicator": {
"requestBased": {
"goodTotalRatio": {
"totalServiceFilter":
"metric.type=\"dataflow.googleapis.com/job/element_count\"
resource.type=\"dataflow_job\"
resource.label.\"job_name\"=\"my_job\"",
"badServiceFilter":
"metric.type=\"logging.googleapis.com/log_entry_count\"
resource.type=\"dataflow_job\"
resource.label.\"job_name\"=\"my_job\"
metric.label.\"severity\"=\"error\"",
}
}
}
Freshness SLIs
Dataflow also writes metric data to Cloud Monitoring
using the
dataflow_job
monitored-resource type and the
job/per_stage_system_lag
metric type, which measures
the current maximum duration that an item of data has been processing or
awaiting processing.
You can express a freshness SLI using this metric by using a
DistributionCut
structure.
The following example SLO expects that the oldest data element is processed in under 100 seconds 99% of the time over a rolling one-hour period:
{
"serviceLevelIndicator": {
"requestBased": {
"distributionCut": {
"distributionFilter":
"metric.type=\"dataflow.googleapis.com/job/per_stage_system_lag\"
resource.type=\"dataflow_job\"
resource.label.\"job_name\"=\"my_job\"",
"range": {
"min": 0,
"max": 100
}
}
}
},
"goal": 0.99,
"rollingPeriod": "3600s",
"displayName": "99% data elements processed under 100 s"
}
You can also express a freshness SLI using a
WindowsBasedSli
structure.
The following example SLO expects that 99% of five-minute windows over a rolling one-day period see no (zero) elements processed in over 100 seconds:
{
"displayName": "Dataflow - windowed freshness",
"serviceLevelIndicator": {
"windowsBased": {
"windowPeriod": "300s",
"metricMeanInRange": {
"timeSeries":
"metric.type=\"dataflow.googleapis.com/job/per_stage_system_lag\"
resource.type=\"dataflow_job\"
resource.label.\"job_name\"=\"my_job\"",
"range": {
"min": "0",
"max": "100"
}
}
}
},
"goal": 0.99,
"rollingPeriod": "86400s"
}
Note that, for a window to be considered "good", the metric cannot exceed the
threshold specified in range
at any point during the evaluation window.
Dataproc
Dataproc provides a fully managed, purpose-built cluster that can automatically scale to support any Hadoop or Spark data- or analytics-processing job.
For additional information, see the following:
- Documentation for Dataproc.
- List of
dataproc.googleapis.com
metric types.
Correctness SLIs
Dataproc writes metric data to Cloud Monitoring using the
cloud_dataproc_cluster
monitored-resource type and the following
metric types:
-
cluster/job/submitted_count
, which counts the total number of jobs submitted. -
cluster/job/failed_count
, which counts the total number of failed jobs.
You can use these metrics to express a request-based correctness SLI as a ratio,
TimeSeriesRatio
, of failed jobs to all
submitted jobs, as shown in the following example:
"serviceLevelIndicator": {
"requestBased": {
"goodTotalRatio": {
"totalServiceFilter":
"metric.type=\"dataproc.googleapis.com/cluster/job/submitted_count\"
resource.type=\"cloud_dataproc_cluster\"
resource.label.\"cluster_name\"=\"my_cluster\"",
"badServiceFilter":
"metric.type=\"dataproc.googleapis.com/cluster/job/failed_count\"
resource.type=\"cloud_dataproc_cluster\"
resource.label.\"cluster_name\"=\"my_cluster\"",
}
}
}
Freshness SLIs
Dataproc also writes metric data to Cloud Monitoring
using the
cloud_dataproc_cluster
monitored-resource type and the following
metric types:
-
cluster/job/duration
, which measures how long jobs stay in processing states. You can filter data on thestate
metric label to identify time spent in specific states. For example, you can create an SLI that measures how long jobs are in thePENDING
state, to set a maximum allowable wait time before the job begins processing. -
cluster/job/completion_time
, which measures how long jobs stay incluster/job/completion_time
metric. Use when job completion is a well-understood metric or when the volume of data processed by jobs in a cluster doesn't vary, which would affect processing time.
You can express a freshness SLI using these metrics by using a
DistributionCut
structure, as shown in the
following examples.
The following example SLO uses cluster/job/duration
and expects that
99% of jobs in "my_cluster" are in the PENDING
state for under 100 seconds
over a rolling 24-hour period:
{
"serviceLevelIndicator": {
"requestBased": {
"distributionCut": {
"distributionFilter":
"metric.type=\"dataproc.googleapis.com/cluster/job/duration\"
resource.type=\"cloud_dataproc_cluster\"
resource.label.\"cluster_name\"=\"my_cluster\"
metric.label.\"state\"=\"PENDING\"",
"range": {
"min": 0,
"max": 100
}
}
}
},
"goal": 0.99,
"rollingPeriod": "86400s",
"displayName": "Dataproc pending jobs"
}
The following example SLO uses cluster/job/completion_time
and expects that
99% of jobs in "my_cluster" are completed in under 100 seconds
over a rolling 24-hour period:
{
"serviceLevelIndicator": {
"requestBased": {
"distributionCut": {
"distributionFilter":
"metric.type=\"dataproc.googleapis.com/cluster/job/completion_time\"
resource.type=\"cloud_dataproc_cluster\"
resource.label.\"cluster_name\"=\"my_cluster\"",
"range": {
"min": 0,
"max": 100
}
}
}
},
"goal": 0.99,
"rollingPeriod": "86400s",
"displayName": "Dataproc completed jobs"
}