Trace sampling

This document introduces the concept of sampling, which refers to whether data for a span is sent to Cloud Trace. When data for a span is sent to Cloud Trace, then that span is sampled. When data for every span in a trace is recorded, the trace is complete. However, traces frequently have missing spans because each instrumented component in a distributed tracing system independently decides whether or not to record the span it is processing.

Although each component makes its own decision as to whether the span it is processing is sampled, that decision can be influenced by the parent's sampling decision. For example, assume every component has a rule that says "if the parent span is sampled, then sample the current span; otherwise, sample 50% of the spans". In this scenario, the following is true:

  • The root span determines whether all spans in the trace are sampled.
  • When the root span is sampled, all spans in the trace are sampled. Therefore, the trace is complete.

Components can pass their sampling decision to the child by using context. For example, in the World Wide Web Consortium (W3C) traceparent header, the sampled flag stores the parent's sampling decision.

Don't confuse sampling with context propagation. Sampling refers to whether a component records data about a span. Context propagation refers to whether information about the span, such as the span ID, is passed to child components.

Sampling strategies

Sampling decisions can either be head-based or tail-based. In head-based sampling, the sampling decision is made when the request is received by the component processing the span. In tail-based sampling, the sampling decision is delayed until after the entire trace is available.

You might encounter the phrase "100% sampling" in documentation for distributed tracing systems. This phrase might apply to a trace or to a component. When applied to a trace, it means that all spans have been sampled, or equivalently, that the trace is complete. When applied to a component, it means that the component samples every span it processes.

Head-based sampling

Head-based samplers are typically configured to always sample spans or to use a probabilistic sampling strategy:

  • With always sample configurations, all components that service spans and that can write trace data, sample the spans they process. Ideally, all traces are complete, and therefore you have the information necessary to troubleshoot failures. This type of configuration might cause you to exceed quotas, or your storage cost limits.

  • With probabilistic sampling, not all spans are sampled. The actual behavior for this approach depends on the component's implementation. In some implementations, all spans have the same probability of being sampled. In others, the sampling decision of the parent influences whether a span is sampled.

Traces might not contain every span. This might be expected due to the use of probabilistic sampling, or it might be due to quota, or to components that process a request but don't sample the span.

Tail-based sampling

Cloud Trace doesn't support tail-based sampling; sampling decisions must be made in the components that send data to Cloud Trace.

If you want to use tail-based sampling, then you can use an intermediary server that receives tracing information which relays data to Cloud Trace after making a sampling decision. For example, you can use the OpenTelemetry Collector with the Tail Sampling Processor to make a delayed sampling decision.

If you plan to use tail sampling, consider the following:

  • You must store all spans in a trace before you make a sampling decision. Therefore, you might require a large amount of temporary storage or incur other overhead.
  • In general, all components that can generate spans for trace need to coordinate. Typically, developers that use OpenTelemetry route all spans for the same trace ID to the same collector.

Sampling and Google Cloud services

Each Google Cloud service makes its own sampling decisions, and not all Google Cloud services sample. That is, a service might never send data to Cloud Trace.

When sampling is supported by a Google Cloud service, that service typically implements the following:

  • A default sample rate.
  • A mechanism to use the parent's sampling decision as a hint as to whether to sample the span.
  • Maximum sampling rate.

To request that a Google Cloud service add support for sampling, use the Google Issue Tracker.

What's next