Stream logs from Google Cloud to Datadog

Last reviewed 2024-12-10 UTC

This document describes a reference architecture to send log event data from across your Google Cloud ecosystem to Datadog Log Management. Sending data from Google Cloud to Datadog uses a Pub/Sub pull subscription and a Dataflow pipeline. This reference architecture is intended for IT professionals who want to stream logs from Google Cloud to Datadog. This document assumes that you are familiar with Datadog Log Management.

Sending your logs to Datadog lets you visualize logs, set up alerts for log events, and correlate logs with metrics and traces from across your stack. Datadog is a cloud-based platform that provides methods to monitor and secure your infrastructure and applications. Datadog Log Management unifies logs, metrics, and traces into a single dashboard. Having a single view helps provide rich context when you analyze your log data.

Architecture

The following diagram shows the architecture that's described in this document. This diagram demonstrates how log files that are generated by Google Cloud are ingested by Datadog and shown to Datadog users. Click the diagram to enlarge it.

Log file ingestion from Google Cloud to Datadog Log Management.

This reference architecture uses Pub/Sub and Dataflow to forward your log files to Datadog. The architecture achieves a high level of log-file throughput by allowing batch delivery and compression. If you generate high-throughput and low-latency event logs, we recommend that you use a pull subscription.

For more information about the different features that are supported by Pub/Sub subscriptions, see the Pub/Sub subscription comparison table.

This architecture diagram includes the following components:

  • Cloud Logging: Routes all of your logs to a Cloud Logging sink, where they can then be filtered and forwarded to supported destinations, like Pub/Sub.
  • Pub/Sub: Forwards your log file data to Datadog through a Dataflow pipeline. When Pub/Sub is integrated with Cloud Logging, Pub/Sub uses topics and pull subscriptions to publish log file data to Pub/Sub topics in near real-time. For more information, see View logs routed to Pub/Sub.
  • Dataflow: Offers two pipeline types to manage Google Cloud log files:
    • Log forwarding: This is the primary pipeline. Dataflow workers batch log file data and then compress it. The pipeline then sends that data to Datadog.
    • Dead-letter pipeline: This is the backup pipeline. When there are data processing errors, Dataflow workers send the log messages to the dead-letter topic. When you've resolved the errors manually, you create this pipeline to resend the data in the dead-letter topic to Datadog.
  • Datadog: Datadog's Log Management system is the destination for your Google Cloud log file data. Each Datadog site provides a unique SSL-encrypted logging endpoint. For more information about the HTTPS endpoint of your Datadog data center, see logging endpoints.

Products used

This reference architecture uses the following Google Cloud and third-party products:

  • Cloud Logging: A real-time log management system with storage, search, analysis, and alerting.
  • Pub/Sub: An asynchronous and scalable messaging service that decouples services that produce messages from services that process those messages.
  • Cloud Storage: A low-cost, no-limit object store for diverse data types. Data can be accessed from within and outside Google Cloud, and it's replicated across locations for redundancy.
  • Dataflow: A service that provides unified stream and batch data processing at scale.
  • Datadog Log Management: A service to collect, process, archive, explore, and monitor all of your logs.

Use case

Use a pre-built Dataflow template to forward your Google Cloud logs to Datadog. The pre-built template minimizes network overhead by batching and compressing the log files before they're transferred.

If your organization uses a Virtual Private Cloud (VPC) with service perimeters, you need a pull-based subscription model to access endpoints that are outside the VPC perimeter. Pull subscriptions are useful if your log volume is highly variable. Use the log data that you transferred to Datadog for your organization's dashboards, alerts, and security platforms. You can also use the log data to troubleshoot.

The log forwarding architecture that's described on this page uses a pull subscription with the Pub/Sub topic. The pull subscription enables access to the external Datadog endpoint. However, when you're using service perimeters, not all of the subscription delivery types provide access to external endpoints.

For more information, see Supported products and limitations.

Design considerations

This section describes design factors, best practices, and design recommendations that you should consider when you use this reference architecture to develop a topology that meets your specific requirements for security, reliability, operational efficiency, cost, and performance.

The guidance in this section isn't exhaustive. Depending on the specific requirements of your application and the Google Cloud and third-party products and features that you use, there might be additional design factors and trade-offs that you should consider.

Security, privacy, and compliance

This section describes factors that you should consider when you design and build a log-file-delivery pipeline between Google Cloud and Datadog that meets your security, privacy, and compliance requirements.

Use private networking for the Dataflow VMs

We recommend that you restrict access to the worker VMs that are used in the Dataflow pipeline by configuring them with private IP addresses. Restricting access lets you keep communication on private networks and away from the public internet.

To restrict access while also allowing the VMs to stream the exported logs into Datadog over HTTPS, configure a public Cloud NAT gateway. After you configure the gateway, map the gateway to the subnet that contains your Dataflow pipeline workers. This mapping lets Dataflow automatically allocate Cloud NAT IP addresses to the VMs in the subnet. The mapping also enables Private Google Access.

For more information about using private networking for the Dataflow VMs, see Private Google Access interaction.

Store the Datadog API key in Secret Manager

This reference architecture uses Secret Manager to store the Datadog API key value. Datadog API keys are unique to your organization. You use them to authenticate to the Datadog API endpoints. Any credentials that you store in Secret Manager are encrypted by default. The credentials offer access control options through IAM, and observability through audit logging.

Secret Manager also supports versioning. That means that you can maintain a policy of short-lived credentials by rotating your API key value whenever it's appropriate.

Each time the Datadog API key value is updated, Secret Manager creates a new version of the Secret Manager secret.

For more information about rotating your API key values, see About rotation schedules.

Create a custom service account for Dataflow pipeline workers

By default, the service account that's used by worker VMs in the Dataflow pipeline is the Compute Engine default service account. This service account provides broad access to resources in your project. To follow the principle of least privilege, you should create a custom service account with the minimum required permissions.

Successfully running a Dataflow job requires the following roles for the Dataflow worker service account:

Use Datadog to maintain data sovereignty

To maintain data sovereignty, Datadog offers unique sites that are distributed throughout the world. Data is never shared between sites. Each site operates independently. Use different sites for specific use cases (such as government security regulations) or to store your data in different regions.

Reliability

This section describes features and design considerations for reliability.

Intake errors

The Datadog API documentation lists the potential errors that you might encounter at intake. The following table briefly describes Datadog status codes, causes, and which error types are automatically retried. For example, 4xx errors aren't automatically retried and 5xx errors are automatically retried.

Status Code Cause Automatically retried?
400 Bad request (likely an issue in the payload formatting) No
401 Unauthorized (likely a missing API Key) No
403 Permission issue (likely using an invalid API Key) No
408 Request timeout No
413 Payload too large No
429 Too many requests No
500 Internal server error Yes
503 Service unavailable Yes

Due to API rate-limits at intake, rapid and unexpected bursts in platform logs can lead to throttling and 429 errors. To help prevent these errors, configure the Pub/Sub to Datadog Dataflow template so that the Dataflow worker pipelines batch up to 1,000 logs in each request. To ensure a timely and consistent flow of logs during slow periods, configure the template so that Google Cloud sends batches to Datadog every two seconds.

Index daily quotas

If you are receiving errors with a 429 status code (too many requests), you might have reached the maximum daily quota for a Datadog log index. By default, Datadog log indexes can receive up to 200,000,000 log events per day before being rate-limited.

Increase your daily quota by directly editing your index in the Datadog user interface or through the Datadog API. You can also set up multiple indexes. Configure each index to have a different retention period. You can also create different queries for each index.

For more information, see Best Practices for Log Management.

Logs silently dropped at intake

Sometimes Datadog drops Google Cloud log files at intake without generating an error status code.

For more information about potential causes, as well as how to use Datadog metrics to determine if you're affected by this issue, see Unexpectedly dropping logs.

Log event tags

A log event shouldn't have more than 100 tags. Each tag shouldn't exceed 200 characters. Tags can include Unicode characters. Datadog supports a maximum of 10,000,000 unique tags per day.

For more information, see Getting Started with Tags.

Log event attributes

Any log event that's converted to the JSON file format should contain less than 256 attributes. Each attribute key should be less than 50 characters. Each key should be nested in less than 10 successive levels. If you intend to promote attributes as log facets, the attributes should have fewer than 1,024 characters.

For more information, see Attributes and Aliasing.

Maximum log file sizes

To learn about the limitations on single log file size, uncompressed payload size, and number of logs grouped together that the Datadog API can accept, see the Datadog Logs API reference.

Operational efficiency

The following sections describe the operational efficiency considerations for this reference architecture.

Overwrite default log attributes with user-defined functions

You can use Datadog Log Management processors to transform and enrich Google Cloud log files after Datadog receives them. However, an alternative transformation option is to extend the Pub/Sub to Datadog template by writing a user-defined function (UDF) in JavaScript.

The UDF can override certain default log attributes such as host or service. For more information, see the User-defined function parameter in the Pub/Sub to Datadog template.

Redeliver failed messages

To prevent data loss, messages that are sent but that aren't delivered to Datadog are sent to the dead-letter topic. This can happen because of 4xx (authorization) or 5xx (server) errors.

When a server error (5xx) occurs, delivery is retried with exponential backoff. The maximum backoff is 15 minutes. If the message isn't successfully delivered in this timeframe, it's sent to the dead-letter topic.

The Datadog Logs API accepts log events with timestamps up to 18 hours in the past. Ensure that you resend any log messages in the dead-letter topic within this timeframe so that they are accepted by the Datadog API.

For failed log messages, use the following process to troubleshoot and then redeliver the logs to Datadog:

  • Inspect the logs and resolve the issues that prevented delivery. For example:
    • For 401 (unauthorized) or 403 (permission issue) errors, confirm that the Datadog API key is valid and that the Dataflow job has access to it.
      • Check the API key validity in Datadog.
      • Check that the Secret Manager secret that contains your valid Datadog API key allows access from the correct service account.
    • Review the reasons for the errors. Other errors might be caused by restrictions on the Datadog logging endpoint. For more information, see the custom log forwarding section of Datadog's log collection documentation.
    • Create a temporary Dataflow job with the Pub/Sub to Pub/Sub template. Use this job to route the undelivered message back into the input topic of the primary log.
    • Confirm that all failed messages in the dead-letter topic have been sent back to the input topic of the primary log.

Performance and cost optimization

The following sections describe the factors that can influence the network and cost efficiency of this reference architecture.

Batch count

For optimal efficiency of network egress traffic, and its associated cost savings, Datadog recommends that you configure your batchCount parameter to the maximum setting of 1,000. This maximum parameter value means that up to 1,000 messages are batched together in a single network request. A batch is sent at least every two seconds, regardless of the batch size.

  • The minimum value for batchCount is 10.
  • The default value for batchCount is 100.

To provide near real-time viewing of Google Cloud logs, Datadog sets the delay between batches to two seconds, regardless of whether the batchCount value has been reached. For example, if your batchCount value is set to 1,000, you continue to receive logs at least every two seconds—even during periods of sparse log-file generation.

Parallelism

To increase the number of requests that are sent to Datadog in parallel, use the parallelism parameter in the Pub/Sub-to-Datadog template. By default, this value is set to 1, which makes parallelism inactive. There is no defined upper limit for parallelism.

For more information about parallelism, see Pipeline lifecycle.

Optimize compute resources with Dataflow Prime

You can incorporate Dataflow Prime with the Pub/Sub to Datadog template. Dataflow Prime is a serverless data-processing platform based on Dataflow. Use one or both of the following parameters to optimize the compute resources that your Dataflow pipeline uses:

  • Vertical autoscaling: To meet the requirements of your pipeline, Dataflow Prime automatically scales the memory capacity of your Dataflow worker VMs. Scaling memory capacity can be useful for bursty workloads that trigger out-of-memory issues. For more information, see Vertical Autoscaling.
  • Right fitting: To help you specify stage-specific or pipeline-wide compute resources, Dataflow Prime creates stage-specific worker pools, with resource hints, for each stage in the Dataflow pipeline. For more information, see Right fitting.

Dataflow pipeline options

Other factors that can influence the performance and cost of your Dataflow pipeline are documented on the Pipeline options page—for example:

If needed, use these settings to further refine the performance and cost efficiency of your Dataflow pipeline.

Deployment

To deploy this architecture, see Deploy Log Streaming from Google Cloud to Datadog.

What's Next

Contributors

Authors:

  • Ashraf Hanafy | Senior Software Engineer for Google Cloud Integrations, Datadog
  • Daniel Trujillo | Engineering Manager, Google Cloud Integrations, Datadog
  • Bryce Eadie | Technical Writer, Datadog
  • Sriram Raman | Senior Product Manager, Google Cloud Integrations, Datadog

Other contributors: