This document discusses monitoring and logging architectures for hybrid and multi-cloud deployments, and provides best practices for implementing them by using Google Cloud. With this document, you can identify which patterns and products are best suited for your environments.
Every enterprise has a unique portfolio of application workloads that place requirements and constraints on the architecture of a hybrid or multi-cloud setup. Although you must design and tailor your architecture to meet these constraints and requirements, you can rely on some common patterns.
The patterns covered in this document fall into two categories:
- In a single pane of glass architecture, all monitoring and logging is centralized, with the aim of providing a single point of access and control.
- In a separate application and operations architecture, sensitive application data is segregated from less sensitive operations data, with the aim of meeting compliance requirements for sensitive data.
Choosing your architecture pattern
You can use the decision tree in the following diagram to identify the best architecture for your use case.
Details of each architecture are discussed further in this document, but at a high level, your choices are as follows:
- Export from Monitoring to legacy solution.
- Export directly to legacy solution.
- Use Monitoring with Prometheus and Fluentd.
- Use Monitoring with Blue Medora BindPlane
Single pane of glass architecture
A common goal for a hybrid system is to integrate monitoring and logging information from various sources across multiple applications and environments into a single display. This type of display is called a single pane of glass.
The following diagram illustrates this pattern where monitoring and logging data from all applications, both on-premises and in the cloud, is centralized into a single repository hosted in the cloud.
This architecture has the following advantages:
- You have a single, consistent view for all monitoring and logging.
- You have a single place to manage data storage and retention.
- You get centralized access control and auditing. However, you still need to ensure the security of data in transit to the central repository.
Monitoring as a single pane of glass
Cloud Monitoring is a Google-managed monitoring and management solution for services, containers, applications, and infrastructure. The Google Cloud operations suite provides a single pane of glass and a robust storage solution for metrics, logs, traces, and events. The suite also provides a complete suite of observability tooling, such as dashboards, reporting, and alerting.
All Google Cloud products and services support integration with Monitoring. In addition, there are several integrated tools that you can use to extend Monitoring to hybrid and on-premises resources.
The following best practices apply to all architectures using Monitoring as a single pane of glass:
- To fulfill compliance requirements for log retention, set up log sinks for your organization.
- For fast analysis of log events, set up log exports to BigQuery for security and access analytics.
- For projects containing sensitive data, consider enabling Data Access audit logs, so you can track who has accessed the data.
- To remove sensitive information, such as Social Security numbers, credit card numbers, and email addresses, you can filter log data. You can filter for collection with a custom Fluentd configuration, or for ingest with logs exclusions. You can export unfiltered logs separately to meet compliance requirements.
- To optimize your costs, analyze your Monitoring and Logging usage and implement cost controls. Monitoring and Logging are managed services with volume-based charges for logs and metrics.
Hybrid monitoring and logging with Monitoring and BlueMedora BindPlane
With Google's partner Blue Medora BindPlane, you can import monitoring and logging data from both on-premises VMs and other cloud providers, such as Amazon Web Services (AWS), Microsoft Azure, Alibaba Cloud, and IBM Cloud into Monitoring. The following diagram shows how Monitoring and BindPlane can provide a single pane of glass for a hybrid cloud.
This architecture has the following advantages:
- In addition to monitoring resources like VMs, Blue Medora has built-in deep integration for over 150 popular data sources.
- There are no additional licensing costs for using BindPlane. BindPlane metrics are imported into Monitoring as custom metrics, which are charged at standard rates. Likewise, BindPlane logs are charged at the same rate as other Logging logs.
For more details about implementing this pattern, see Logging and monitoring on-premises resources with Blue Medora.
Hybrid Google Kubernetes Engine monitoring with Prometheus and Monitoring
With Prometheus, a popular open source monitoring solution, and the Monitoring Prometheus sidecar, you can monitor applications running on multiple Kubernetes clusters with Monitoring. This architecture is useful when running Kubernetes workloads distributed across Google Kubernetes Engine (GKE) on Google Cloud and Anthos clusters on VMware in your on-premises data center, because it provides a unified interface across both. The following diagram shows how to use Prometheus and the Monitoring sidecar for data collection.
This architecture has the following advantages:
- Consistent Kubernetes metrics across cloud and on-premises environments.
- There are no additional licensing costs for using Prometheus. Prometheus metrics are imported into Monitoring as custom metrics, which are charged at standard rates.
This architecture has the following disadvantages:
- The Monitoring Prometheus sidecar is supported only in GKE environments.
- Prometheus supports monitoring only, so logging has to be configured separately. The following section discusses a common option for logging, Fluentd.
We recommend the following best practice:
- By default, Prometheus collects all exposed metrics, each of which becomes a chargeable custom metric. To avoid unexpected costs, implement Monitoring cost controls and consider adding filters in the Monitoring Prometheus sidecar to limit ingestion.
For more details about implementing this pattern, see Monitoring apps running on multiple GKE clusters using Prometheus and Monitoring.
Hybrid GKE logging with Fluentd and Logging
With Fluentd, a popular open source logging agent and Cloud Logging, you can ingest logs from applications running on multiple GKE clusters to Cloud Logging. This architecture is useful when running Kubernetes workloads distributed across GKE on Google Cloud and Anthos clusters on VMware in your on-premises data center, because it provides a unified interface across both. The following diagram illustrates the flow of logs.
This architecture has the following advantages:
- You can have consistent Kubernetes logging across cloud and on-premises environments.
- You can customize Logging to filter out sensitive information.
- There are no additional licensing costs for using Fluentd. Fluentd logs imported into Logging are charged at standard rates.
This architecture has the following disadvantages:
- Fluentd supports logging only, so monitoring has to be configured separately. The previous section discusses a common option for monitoring with Prometheus.
For more details about implementing this pattern, see Customizing Logging logs for Google Kubernetes Engine with Fluentd.
Partner services as single panes of glass
If you are already using a third-party monitoring or logging service such as Datadog or Splunk, you might not want to move to Logging. If so, you can export data from Google Cloud to many common monitoring and logging services. You can choose to use an integrated monitoring and logging service, or select separate monitoring and logging services that best fit your needs.
Export from Logging to partner services
In this pattern, you authorize the partner's monitoring service, such as Datadog, to connect to the Cloud Monitoring API. This authorization lets the service ingest all the metrics available to Logging, so Datadog can function as a single pane of glass for monitoring.
For logging data, Logging provides exports (log sinks) to Pub/Sub. These exports provide a performant and resilient method for partner logging services such as Elastic and Splunk to ingest large volumes of logs from Logging in real time, so these partner services can serve a single pane of glass for logs.
The combined architecture for logging and monitoring is shown in the following diagram.
This architecture has the following advantages:
- You can continue to use familiar existing tools.
- Google Cloud Support continues to have access to Logging logs for troubleshooting.
This architecture has the following disadvantages:
- Partner solutions are typically externally hosted, which means they might not be available or collect data if network connections are disrupted. Sometimes, you can mitigate this risk by self-hosting, but at the cost of having to maintain the infrastructure for the solution yourself.
- Externally hosted dashboards aren't directly available to Google Cloud Support. This lack of availability can slow down troubleshooting and mitigation.
- Commercial partner solutions might entail more licensing fees.
Some detailed example integrations include the following:
- Datadog: Monitoring Compute Engine metrics and Collect Logging Logs
- Elastic: Exporting Logging logs to Elastic Cloud
- Splunk: Scenarios for exporting Logging
Export metrics from Prometheus and Logging to Grafana
Grafana is a popular open source monitoring tool commonly paired with Prometheus for metrics collection. In this architecture, you use Prometheus as the on-premises collection layer and use Grafana as a single pane of glass for both Google Cloud and on-premises resources. The following diagram shows a sample architecture exporting metrics from Google Cloud and on-premises.
This architecture has the following advantages:
- It's suitable for hybrid environments with both VMs and containers.
- If your organization is already using Prometheus and Grafana, your users can continue to use them.
This architecture has the following disadvantages:
- Prometheus and Grafana support monitoring only, so logging has to be configured separately, for example, using Fluentd.
- Prometheus is open source and extensible, but supports only a limited range of enterprise software integrations.
- Prometheus and Grafana are third-party tools and not official Google products. Google doesn't offer support for Prometheus or Grafana.
For more information, see Introducing Logging as a data source for Grafana. For a detailed tutorial of deploying Prometheus and Grafana with GKE and Logging, see Using Prometheus and Grafana for IoT monitoring.
Export logs using Fluentd
An earlier pattern covered using Fluentd as a log collector for Logging. The same basic architecture can also be used for other logging or data analytics systems that support Fluentd, including BigQuery, Elastic, and Splunk. The following diagram illustrates this pattern.
This architecture has the following advantages:
- It's suitable for hybrid environments with both VMs and containers.
- Fluentd can read from many data sources, including system logs.
- Fluentd offers output plugins for many popular third-party logging and data analytics systems.
This architecture has the following disadvantages:
- Fluentd supports logs only, so monitoring has to be configured separately. The previous section discusses common options for monitoring with Prometheus and Grafana.
- Fluentd is a third-party tool and not an official Google product. Google doesn't offer support for Fluentd.
- Exported logs are not available to Google Cloud Support for troubleshooting. In particular, Google does not offer support for Anthos clusters on VMware clusters without Logging enabled.
For an example of integrating Fluentd with BigQuery, see Analyzing logs in real time using Fluentd and BigQuery.
Separate application and operations data
Single pane of glass architectures require streaming application monitoring and logging data to the cloud. However, you might have regulatory or compliance requirements that either require keeping customer data on-premises or place strict constraints on what data can be stored in the public cloud.
A useful pattern for these hybrid environments is to separate sensitive application data from lower-risk operations data, as illustrated in the following diagram.
Separate application and system data with Anthos
Anthos on VMware, a part of the Anthos suite, includes Grafana for monitoring on-premises clusters. In addition, you can opt to install a partner solution such as Elastic Stack or Splunk for logging. Using these solutions, you can ingest and view sensitive application data entirely on-premises, while still exporting system data to Logging on Google Cloud. The following diagram illustrates this architecture.
This architecture has the following advantages:
- Sensitive application data is kept entirely on-premises.
- On-premises monitoring and logging have no cloud dependencies and remain available even if the network connection is interrupted.
- All GKE system data, both on-premises and Google Cloud, is centralized in Logging and is also accessible to Google Cloud Support as needed.
For an example implementation using Elastic Stack as the partner solution, see Monitoring Anthos with the Elastic Stack.
What's next
- Learn more about hybrid and multi-cloud best practices with the Hybrid and multi-cloud patterns and practices series, including architecture patterns and network topologies.
- Enroll in the Cloud Kubernetes Best Practice quest for hands-on exercises about observability and more on GKE.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.