Design a self-service data platform for a data mesh

Last reviewed 2022-10-06 UTC

In a data mesh, a self-service data platform enables users to generate value from data by enabling them to autonomously build, share, and use data products. To fully realize these benefits, we recommend that your self-service data platform provide the capabilities described in this document.

This document is part of a series which describes how to implement a data mesh on Google Cloud. It assumes that you have read and are familiar with the concepts described in Build a modern, distributed Data Mesh with Google Cloud and Architecture and functions in a data mesh.

The series has the following parts:

Data platform teams typically create central self-service data platforms, as described in this document. This team builds the solutions and components that domain teams (both data producers and data consumers) can use to both create and consume data products. Domain teams represent functional parts of a data mesh. By building these components, the data platform team enables a smooth development experience and reduces the complexity of building, deploying, and maintaining data products that are secure and interoperable.

Ultimately, the data platform team should allow domain teams to move faster. They help increase the efficiency of domain teams by providing those teams with a limited set of tools that address their needs. In providing these tools, the data platform team removes the burden of having the domain team build and source these tools themselves. The tooling choices should be customizable to different needs and not force an inflexible way of working on the data domain teams.

The data platform team shouldn't focus on building custom solutions for data pipeline orchestrators or for continuous integration and continuous deployment (CI/CD) systems. Solutions such as CI/CD systems are readily available as managed cloud services, for example, Cloud Build. Using managed cloud services can reduce operational overheads for the data platform team and let them focus on the specific needs of the data domain teams as the users of the platform. With reduced operational overhead, the data platform team can focus more time on addressing the specific needs of the data domain teams.

Architecture

The following diagram illustrates the architecture components of a self-service data platform. The diagram also shows how these components can support teams as they develop and consume data products across the data mesh.

Self-service data platform components, as described in following text.

As shown in the preceding diagram, the self-service data platform provides the following:

  • Platform solutions: These solutions consist of composable components for provisioning Google Cloud projects and resources, which users select and assemble in different combinations to meet their specific requirements. Instead of directly interacting with the components, users of the platform can interact with platform solutions to help them to achieve a specific goal. Data domain teams should design platform solutions to solve common pain-points and friction areas that cause slowdowns in data product development and consumption. For example, data domain teams onboarding onto the data mesh can use an infrastructure-as-code (IaC) template. Using IaC templates lets them quickly create a set of Google Cloud projects with standard Identity and Access Management (IAM) permissions, networking, security policies, and relevant Google Cloud APIs enabled for data product development. We recommend that each solution is accompanied with documentation such as "how to get started" guidance and code samples. Data platform solutions and their components must be secure and compliant by default.

  • Common services: These services provide data product discoverability, management, sharing, and observability. These services facilitate data consumers' trust in data products, and are an effective way for data producers to alert data consumers to issues with their data products.

Data platform solutions and common services might include the following:

  • IaC templates to set up foundational data product development workspace environments, which include the following:
    • IAM
    • Logging and monitoring
    • Networking
    • Security and compliance guardrails
    • Resource tagging for billing attribution
    • Data product storage, transformation, and publishing
    • Data product registration, cataloging, and metadata tagging
  • IaC templates which follow organizational security guardrails and best practices that can be used to deploy Google Cloud resources into existing data product development workspaces.
  • Application and data pipeline templates that can be used to bootstrap new projects or used as reference for existing projects. Examples of such templates include the following:
    • Usage of common libraries and frameworks
    • Integration with platform logging, monitoring, and observability tooling
    • Build and test tooling
    • Configuration management
    • Packaging and CI/CD pipelines for deployment
    • Authentication, deployment, and management of credentials
  • Common services to provide data product observability and governance which can include the following:
    • Uptime checks to show the overall state of data products.
    • Custom metrics to give helpful indicators about data products.
    • Operational support by the central team such that data consumer teams are alerted of changes in data products they use.
    • Product scorecards to show how data products are performing.
    • A metadata catalog for discovering data products.
    • A centrally defined set of computational policies that can be applied globally across the data mesh.
    • A data marketplace to facilitate data sharing across domain teams.

Create platform components and solutions using IaC templates discusses the advantages of IaC templates to expose and deploy data products. Provide common services discusses why it's helpful to provide domain teams with common infrastructure components that have been built and are managed by the data platform team.

Create platform components and solutions using IaC templates

The goal of data platform teams is to set up self-service data platforms to get more value from data. To build these platforms, they create and provide domain teams with vetted, secure, and self-serviceable infrastructure templates. Domain teams use these templates to deploy their data development and data consumption environments. IaC templates help data platform teams achieve that goal and enable scale. Using vetted and trusted IaC templates simplifies the resource deployment process for domain teams by allowing those teams to reuse existing CI/CD pipelines. This approach lets domain teams quickly get started and become productive within the data mesh.

IaC templates can be created using an IaC tool. Although there are multiple IaC tools, including Cloud Config Connector, Pulumi, Chef, and Ansible, this document provides examples for Terraform-based IaC tools. Terraform is an open source IaC tool that allows the data platform team to efficiently create composable platform components and solutions for Google Cloud resources. Using Terraform, the data platform team writes code that specifies the desired end-state and lets the tool figure out how to achieve that state. This declarative approach lets the data platform team treat infrastructure resources as immutable artifacts for deployment across environments. It also helps to reduce the risk of inconsistencies arising between deployed resources and the declared code in source control (referred to as configuration drift). Configuration drift caused by ad-hoc and manual changes to infrastructure hinders safe and repeatable deployment of IaC components into production environments.

Common IaC templates for composable platform components include using Terraform modules for deploying resources such as a BigQuery dataset, Cloud Storage bucket, or Cloud SQL database. Terraform modules can be combined into end-to-end solutions for deploying complete Google Cloud projects, including relevant resources deployed using the composable modules. Example Terraform modules can be found in the Terraform blueprints for Google Cloud.

Each Terraform module should by default satisfy security guardrails and compliance policies that your organization uses. These guardrails and policies can also be expressed as code and be automated using automated compliance verification tooling such as Google Cloud policy validation tool.

Your organization should continuously test the platform-provided Terraform modules, using the same automated compliance guardrails that it uses to promote changes into production.

To make IaC components and solutions discoverable and consumable for domain teams that have minimal experience with Terraform, we recommend that you use services such as Service Catalog. Users who have significant customization requirements should be allowed to create their own deployment solutions from the same composable Terraform templates used by existing solutions.

When using Terraform, we recommend that you follow the Google Cloud best-practices as outlined in Best practices for using Terraform.

To illustrate how Terraform can be used to create platform components, the following sections discuss examples of how Terraform can be used to expose consumption interfaces and to consume a data product.

Expose a consumption interface

A consumption interface for a data product is a set of guarantees on the data quality and operational parameters provided by the data domain team to enable other teams to discover and use their data products. Each consumption interface also includes a product support model and product documentation. A data product may have different types of consumption interfaces, such as APIs or streams, as described in Build data products in a data mesh. The most common consumption interface might be a BigQuery authorized dataset, authorized view, or authorized function. This interface exposes a read-only virtual table, which is expressed as a query into the data mesh. The interface does not grant reader permissions to directly access the underlying data.

Google provides an example Terraform module for creating authorized views without granting teams permissions to the underlying authorized datasets. The following code from this Terraform module grants these IAM permissions on the dataset_id authorized view:

module "add_authorization" {
  source = "terraform-google-modules/bigquery/google//modules/authorization"
  version = "~> 4.1"

  dataset_id = module.dataset.bigquery_dataset.dataset_id
  project_id = module.dataset.bigquery_dataset.project

  roles = [
    {
      role           = "roles/bigquery.dataEditor"
      group_by_email = "ops@mycompany.com"
    }
  ]

  authorized_views = [
    {
      project_id = "view_project"
      dataset_id = "view_dataset"
      table_id   = "view_id"
    }
  ]
  authorized_datasets = [
    {
      project_id = "auth_dataset_project"
      dataset_id = "auth_dataset"
    }
  ]
}

If you need to grant users access to multiple views, granting access to each authorized view can be both time consuming and harder to maintain. Instead of creating multiple authorized views, you can use an authorized dataset to automatically authorize any views created in the authorized dataset.

Consume a data product

For most analytics use cases, consumption patterns are determined by the application that the data is being used in. The main use of a centrally provided consumption environment is for data exploration before the data is used within the consuming application. As discussed in Discover and consume products in a data mesh, SQL is the most commonly used method for querying data products. For this reason, the data platform should provide data consumers with a SQL application for exploration of the data.

Depending on the analytics use case, you may be able to use Terraform to deploy the consumption environment for data consumers. For instance, data science is a common use case for data consumers. You can use Terraform to deploy Vertex AI user-managed notebooks to be used as a data science development environment. From the data science notebooks, data consumers can use their credentials to log in to the data mesh to explore data to which they have access and develop ML models based on this data.

To learn how to use Terraform to deploy and help to secure a notebook environment on Google Cloud, see Protecting confidential data in Vertex AI Workbench user-managed notebooks.

Provide common services

In addition to self-service IaC components and solutions, the data platform team might also take ownership over building and operating common shared platform services used by multiple data domain teams. Common examples of shared platform services include self-hosted third-party software such as business intelligence visualization tools or a Kafka cluster. In Google Cloud, the data platform team might choose to manage resources such as Dataplex and Cloud Logging sinks on behalf of data domain teams. Managing resources for the data domain teams lets the data platform team facilitate centralized policy management and auditing across the organization.

The following sections show how to use Dataplex for central management and governance within a data mesh on Google Cloud, and the implementation of data observability features in a data mesh.

Dataplex for data governance

Dataplex provides a data management platform that helps you to build independent data domains within a data mesh that spans the organization. Dataplex lets you maintain central controls for governing and monitoring the data across domains.

With Dataplex an organization can logically organize their data (supported data sources) and related artifacts such as code, notebooks, and logs, into a Dataplex lake that represents a data domain. In the following diagram, a sales domain uses Dataplex to organize its assets, including data quality metrics and logs, into Dataplex zones.

Assets organized by Dataplex.

As shown in the preceding diagram, Dataplex can be used to manage domain data across the following assets:

  • Dataplex allows data domain teams to consistently manage their data assets in a logical group called a Dataplex lake. The data domain team can organize their Dataplex assets within the same Dataplex lake without physically moving data or storing it into a single storage system. Dataplex assets can refer to Cloud Storage buckets and BigQuery datasets stored in multiple Google Cloud projects other than the Google Cloud project containing the Dataplex lake. Dataplex assets can be structured or unstructured, or be stored in an analytical data lake or data warehouse. In the diagram, there are data lakes for the sales domain, supply chain domain, and product domain.
  • Dataplex zones enable the data domain team to further organize data assets into smaller subgroups within the same Dataplex lake and add structures that capture key aspects of the subgroup. For example, Dataplex zones can be used to group associated data assets in a data product. Grouping data assets into a single Dataplex zone allows data domain teams to manage access policies and data governance policies consistently across the zone as a single data product. In the diagram, there are data zones for offline sales, online sales, supply chain warehouses, and products.

Dataplex lakes and zones enable an organization to unify distributed data and organize it based on the business context. This arrangement forms the foundation for activities such as managing metadata, setting up governance policies, and monitoring data quality. Such activities allow the organization to manage its distributed data at scale, such as in a data mesh.

Data observability

Each data domain should implement its own monitoring and alerting mechanisms, ideally using a standardized approach. Each domain can apply the monitoring practices described in Concepts in service monitoring, making the necessary adjustments to the data domains. Observability is a large topic, and is outside of the scope of this document. This section only addresses patterns which are useful in data mesh implementations.

For products with multiple data consumers, providing timely information to each consumer about the status of the product can become an operational burden. Basic solutions, such as manually managed email distributions, are typically prone to error. They can be helpful for notifying consumers of planned outages, upcoming product launches, and deprecations, but they don't provide real-time operational awareness.

Central services can play an important role in monitoring the health and quality of the products in the data mesh. Although not a prerequisite for a successful implementation of the data mesh, implementing observability features can improve satisfaction of the data producers and consumers, and reduce overall operational and support costs. The following diagram shows an architecture of data mesh observability based on Cloud Monitoring.

Data mesh observability.

The following sections describe the components shown in the diagram, which are as follows:

Uptime checks

Data products can create simple custom applications that implement uptime checks. These checks can serve as high-level indicators of the overall state of the product. For example, if the data product team discovers a sudden drop in data quality of its product, the team can mark that product unhealthy. Uptime checks that are close to real time are especially important to data consumers who have derived products that rely on the constant availability of the data in the upstream data product. Data producers should build their uptime checks to include checking their upstream dependencies, thus providing an accurate picture of the health of their product to their data consumers.

Data consumers can include product uptime checks into their processing. For example, a composer job that generates a report based on the data provided by a data product can, as the first step, validate whether the product is in the "running" state. We recommend that your uptime check application returns a structured payload in the message body of its HTTP response. This structured payload should indicate whether there's a problem, the root cause of the problem in human readable form, and if possible, the estimated time to restore the service. This structured payload can also provide more fine-grained information about the state of the product. For example, it can contain the health information for each of the views in the authorized dataset exposed as a product.

Custom metrics

Data products can have various custom metrics to measure their usefulness. Data producer teams can publish these custom metrics to their designated domain-specific Google Cloud projects. To create a unified monitoring experience across all data products, a central data mesh monitoring project can be given access to those domain-specific projects.

Each type of data product consumption interface has different metrics to measure its usefulness. Metrics can also be specific to the business domain. For example, the metrics for BigQuery tables exposed through views or through the Storage Read API can be as follows:

  • The number of rows.
  • Data freshness (expressed as the number of seconds before the measurement time).
  • The data quality score.
  • The data that's available. This metric can indicate that the data is available for querying. An alternative is to use the uptime checks mentioned earlier in this document.

These metrics can be viewed as service level indicators (SLI) for a particular product.

For data streams (implemented as Pub/Sub topics), this list can be the standard Pub/Sub metrics, which are available through topics.

Operational support by the central data platform team

The central data platform team can expose custom dashboards to display different levels of details to the data consumers. A simple status dashboard that lists the products in the data mesh and uptime status for those products can help answer multiple end-user requests.

The central team can also serve as a notification distribution hub to notify data consumers about various events in the data products they use. Typically, this hub is made by creating alerting policies. Centralizing this function can reduce the work that must be done by each data producer team. Creating these policies doesn't require knowledge of the data domains and should help avoid bottlenecks in data consumption.

An ideal end state for data mesh monitoring is for the data product tag template to expose the SLIs and service-level objectives (SLOs) that the product supports when the product becomes available. The central team can then automatically deploy the corresponding alerting using service monitoring with the Monitoring API.

Product scorecards

As part of the central governance agreement, the four functions in a data mesh can define the criteria to create scorecards for data products. These scorecards can become an objective measurement of data product performance.

Many of the variables used to calculate the scorecards are the percentage of time that data products are meeting their SLO. Useful criteria can be the percentage of uptime, average data quality scores, and percentage of products with data freshness that does not fall below a threshold. To calculate these metrics automatically using Monitoring Query Language (MQL), the custom metrics and the results of the uptime checks from the central monitoring project should be sufficient.

What's next