Data mesh is an architectural framework for managing data in complex organizations. Unlike centralized models, data mesh decentralizes data ownership to domain-specific teams. This approach can help to eliminate bottlenecks by treating data as a product, but it also introduces new resource requirements. Success with data mesh depends on domain teams possessing specific data engineering skills and governance capabilities. For organizations that have the resources to support distributed teams, data mesh can improve agility. For others, centralized models like data warehouses or data lakes may remain a more efficient solution.
Data mesh isn't just about a new set of tools or technologies; it's a shift in how companies think about their data. There are four core principles that guide the data mesh approach. These principles are what make the approach so effective at solving the problems of traditional, centralized data architectures.
In a traditional data architecture, a single central team, like an IT or data engineering team, is responsible for all data. In a data mesh, data ownership is spread out to the business domains that create the data. For example, a sales team would own the customer data they generate, and a marketing team would own the campaign data they create. This makes teams more responsible and accountable for the data they produce.
With domain-oriented ownership, the teams that create data must also treat it like a product. Just as a company would provide a high-quality product to a customer, a data domain team needs to provide high-quality data to other teams that need it. This means the data is easy to discover, understand, and use. It also has to be trustworthy, secure, and well-documented with built-in access controls so the right people only access the data that is intended for their use case.
To make treating data as a product possible, a data mesh uses a self-serve platform. This platform is a set of tools and services that allows data domain teams to easily create and manage their data products without needing help from a central data team. It can be a simple, easy-to-use platform that automates many of the technical tasks involved in data management, like data storage, security, and governance.
Since data is decentralized and spread across many different teams, there needs to be a way to ensure everyone follows the same rules. This is where federated computational governance comes in. It's a model where a small, central team sets the global rules and standards for all data. However, the enforcement of these rules is handled by the data domain teams themselves. This combines the best of both worlds: centralized policies with decentralized execution.
A data product in a data mesh should be findable, addressable, trustworthy, self-describing, and secure. It should be easy for data consumers to discover the data, understand what it is, and know that it's high quality. It should also have clear and consistent access rules to ensure security.
Starting a data mesh is an incremental process. It's often best to start with a small pilot project and a few willing domain teams. Begin by identifying a business domain that can benefit from greater data autonomy. Then, create a minimal self-serve platform that allows that team to create a data product. As the project succeeds, you can use the results as a proof of concept to get the broader organization on board with the data mesh architecture.
One of the biggest challenges is the cultural shift. It can be difficult for a centralized data team to give up control. There are also technical challenges, such as ensuring data security and managing a distributed system. However, with careful planning and a clear communication strategy, these challenges can be overcome.
Data mesh is designed to work with existing data systems. It doesn't require you to throw out your current data lakes or data warehouses. Instead, it can be implemented on top of them. A data mesh can act as a new layer that provides a unified, self-serve way for teams to access data from different sources.
A common misconception is that data mesh is a product you can buy. It's not. It's a new way of organizing and managing data. Another myth is that it's only for large enterprises. While it's most common in large companies, the principles can be applied to smaller organizations as well.
Measuring the success of a data mesh can be tricky because the benefits are often not financial at first. Instead, you can measure success by looking at things like the speed of data delivery, the number of teams using the data platform, and the trust that teams have in the data they are consuming. Over time, these improvements can lead to better business outcomes and a higher return on investment (ROI).
The data mesh approach was created to solve some of the common problems with traditional data architectures. These models, such as data warehouses or data lakes owned by individual departments or teams, can create data silos and governance risks, especially as a company grows. Data mesh tackles these issues by distributing ownership and empowering individual teams while still maintaining central controls for governing and monitoring the data across domains.
Feature | Data mesh | Traditional architectures |
Architectural model | Decentralized and distributed across business domains. | Centralized and monolithic, managed by a single team. |
Data ownership | Data is owned by the domain teams that create and use it. | Data is owned and managed by a central data team. |
Data access | Teams access data through standardized data products. | Teams must go through a central team to get data. |
Scalability | Can scale easily as new domain teams and data products are added. | Can become a bottleneck as the organization and data volume grow. |
Data quality | Domain teams are accountable for their own data quality, which can increase trust and accuracy. | Data quality can be inconsistent as the central team may lack the context of each domain. |
Data governance | Governance is federated, with global standards and rules set centrally but enforced by domain teams. | Governance is centralized and handled entirely by one team. |
Use case | Can be best for large, complex organizations with diverse data and independent business units. | Can be best for smaller organizations or for specific use cases that require a single source of truth. |
Technical expertise/ resources needed | Requires distributed technical skills (engineering, governance) within each domain team. | Centralizes technical expertise in one core IT or data engineering team. |
Feature
Data mesh
Traditional architectures
Architectural model
Decentralized and distributed across business domains.
Centralized and monolithic, managed by a single team.
Data ownership
Data is owned by the domain teams that create and use it.
Data is owned and managed by a central data team.
Data access
Teams access data through standardized data products.
Teams must go through a central team to get data.
Scalability
Can scale easily as new domain teams and data products are added.
Can become a bottleneck as the organization and data volume grow.
Data quality
Domain teams are accountable for their own data quality, which can increase trust and accuracy.
Data quality can be inconsistent as the central team may lack the context of each domain.
Data governance
Governance is federated, with global standards and rules set centrally but enforced by domain teams.
Governance is centralized and handled entirely by one team.
Use case
Can be best for large, complex organizations with diverse data and independent business units.
Can be best for smaller organizations or for specific use cases that require a single source of truth.
Technical expertise/ resources needed
Requires distributed technical skills (engineering, governance) within each domain team.
Centralizes technical expertise in one core IT or data engineering team.
The data mesh approach can be particularly useful for large, complex organizations that have multiple business units and a large amount of data. Here are a few common use cases where a data mesh can provide significant value.
A data mesh can help an organization get more value from its data analytics and business intelligence (BI) initiatives. With data products from different domains, data scientists and analysts can get a more complete view of the business. For example, a retail company can combine customer data from its sales domain with web traffic data from its marketing domain to better understand customer behavior.
A customer 360 initiative aims to create a complete view of a customer by combining data from different sources. This can be challenging in a centralized data architecture because data is often siloed in different departments. A data mesh makes this much easier by providing a standardized way to access and combine data products from different domains, such as sales, marketing, and support.
In financial services, a data mesh can be used for real-time monitoring and fraud detection. A bank, for instance, could have a data product for transactions and another for customer login data. A fraud detection system can then access both data products to identify suspicious activity. The decentralized nature of a data mesh can help with the speed and reliability needed for these kinds of applications.
As data privacy regulations become more complex, it can be difficult to ensure compliance in a centralized data model. A data mesh can help with regulatory compliance by allowing domain teams to manage their own data products and ensure they are compliant with local laws. This is particularly important for multinational companies that need to adhere to different data sovereignty rules in different countries.
Advanced AI applications and agents require high-quality, context-rich data to function effectively. In a data mesh, domain teams curate data specifically for consumption, ensuring that it is clean, labeled, and documented. This allows data scientists to train models on reliable inputs without spending excessive time on data preparation. Furthermore, AI agents can access these modular data products via APIs to retrieve real-time information, enabling them to perform complex tasks across different business domains with greater accuracy.
Adopting a data mesh can provide significant benefits for an organization. By moving to a decentralized model, companies can overcome the bottlenecks of traditional architectures and achieve better business outcomes.
Agility and scalability
A data mesh can be more agile. Each data domain can work independently, which allows the organization to scale and evolve more quickly. It can make it easier to add new data products and services without causing disruptions.
Data quality and trust
A data mesh can assign accountability to the domain teams that produce the data. Since the domain teams are also the primary consumers of their own data, they have a strong incentive to ensure its quality. This can lead to more trustworthy data.
Cost efficiency
A data mesh can also help a company become more cost efficient. With a centralized data platform, teams often have to wait for a central data team to help them with their data needs. This can lead to delays and wasted resources.
Dataplex Universal Catalog acts as a unified data fabric and provides a central governance layer over your data mesh. It can help you discover, manage, and govern your distributed data across various environments, ensuring you have a single source of truth for metadata and policies. To get started, you'll need to create a Dataplex lake. A Dataplex lake is a top-level container that holds your data and is typically mapped to a business domain.
Here are the steps to create a lake:
Dataplex then automatically scans these assets to discover and catalog metadata.
A key part of the "data as a product" principle is making data easily discoverable. BigQuery data sharing allows you to build a data product marketplace. This lets domain teams securely share data products with other teams without copying or moving the data. It can help data consumers find the data they need and provides them with a clear, well-defined interface to access it.
Google Cloud's serverless services empower domain teams to create and manage their own data products with minimal overhead. BigQuery is a powerful, serverless data warehouse that allows teams to analyze large datasets quickly and efficiently. Dataflow is a serverless data processing service that can be used to build and automate data pipelines for data products. These services reduce the need for a central data engineering team to manage infrastructure, making domain teams more autonomous and agile.
Federated computational governance is the principle of having a central team define global rules, but allowing domain teams to enforce them. Google Cloud's Identity and Access Management (IAM) conditions provide the tools to implement this. IAM conditions allow for attribute-based access control (ABAC), where you can set up fine-grained permissions based on data attributes. For example, you can create a policy that only allows a user to access customer data from their specific region, helping ensure compliance with data sovereignty regulations like GDPR.
Start building on Google Cloud with $300 in free credits and 20+ always free products.