Build a modern, distributed Data Mesh with Google Cloud
Product Management, Google
Senior Product Manager
Data drives innovation, but business needs are changing more rapidly than processes can accommodate, resulting in a widening gap between data and value. Your organization has many data sources that you use to make decisions, but how easy is it to access these new data sources? Do you trust the reports generated from these data sources? Who are the owners and producers of these data sources? Is there a centralized team who should be responsible for producing and serving every single data source in your organization? Or is it time to decentralize some data ownership and speed up data production? In other words, is it time to let the teams with the most context around data own it?
From a technology perspective, data platforms support these ambitions already. In the past, you were concerned about whether you had enough capacity or the amount of engineering time needed to incorporate new data sources into your analytics stack. The data processing, network, and storage barriers are now coming down, and you can ingest, store, process, and access much more data residing in different source systems without costing a fortune.
But here’s the thing. Even though data platforms have evolved, the organizational model for generating analytics data and processes users follow to access and use it haven't. Many organizations rely on a central team to create a repository of all the data assets in the organization, and then make them useful and accessible to the users of that data. This slows companies down from getting the value they want from their data. When we talk to our customers we see one of two problems:
The first problem is a data bottleneck. There’s only one team, sometimes just one person or system that can access the data, so every request for data must go through them. The central team is also asked to interpret the use cases for that data, and make judgments on the data assets required without having much domain knowledge about the data. This situation causes a lot of frustration for data analysts, data scientists and ultimately any business user who requires data for decision making. Over time, people give up on waiting and make decisions without data.
Data chaos is the other thing that happens, because people get fed up with the bottleneck. People copy the most relevant data they can find, not knowing if it is the best option available to them. This data duplication (and subsequent uses) can happen enough times that users lose track of the source of truth of the data, its freshness, and what the data means. Aside from being a data governance nightmare, this creates unnecessary work and a waste of system resources, leading to increased complexity and cost. It slows everyone down and erodes trust in data.
To address the above challenges, organizations may wish to give business domains autonomy in generating, analyzing, and exposing data as data products, as long as these data products have a justifiable use case. The same business domains would own their data products throughout their entire lifecycle.
In this model, the need for a central data team remains, although without ownership of the data itself. The goal of the central team is to support users in generating value from data by enabling them to autonomously build, share, and use data products. The central team does this via a set of standards and best practices for domains to build, deploy, and maintain data products that are secure and interoperable, governance policies to build trust in these products (and the tooling to assist domains to adhere to them), and a common platform to enable self-serve discovery and use of data products by domains. Their job is made easier by an already self-service and serverless data platform.
In 2019, Zhamak Dehghani introduced to the world the notion of Data Mesh, applying a DevOps mentality that was developed through infrastructure modernization to data. Coincidentally, this is how Google has been operating internally over the last decade. A decentralized data platform is achieved by using BigQuery behind the scenes. As a result, instead of moving data from domains into a centrally owned data lake or platform, domains can host and serve their domain datasets in an easily consumable way. The business area generating data becomes responsible for owning and serving their datasets for access by teams with a business need for that data. We have been working with numerous customers over the last two years who are eager to try Data Mesh out for themselves.
We have written about how to build a data mesh on Google Cloud in detail: you can read the full whitepaper here, and a follow up guide to implementation here. In a nutshell, Data Mesh is an architectural paradigm that decentralizes data ownership into the teams that have the greatest business context about that data. These teams take on the responsibility of keeping data fresh, trustworthy, and discoverable by data consumers elsewhere in the company. Data effectively becomes a product, owned and managed within a domain by the teams who know it best. For this approach to work, governance also needs to be federated across the domains, so that management of data and access can be customized, within boundaries, by the data owners as well.
The idea of a Data Mesh is alluring; it combines business needs with technology in a way we don’t typically see. It promises a solution to help break down organizational barriers in extracting value from data. To do this, companies must adopt four principles of Discoverability, Accessibility, Ownership, and (Federated) Governance, which require a coordinated effort across technical and business unit leadership. In practice, each group that owns a data domain across a decentralized organization may need to employ a hybrid group of data workers to take on the increased data curation, data management, data engineering, and data governance tasks required to own and maintain data products for that domain. From day-to-day operations of the team to employee management and performance evaluations, this significantly impacts an organization, so it is not a small change to make and needs buy-in from cross-functional stakeholders and leadership across the company.
It is essential that the offices of the Chief Information Security Officer (CISO), Chief Data Officer (CDO), and Chief Information Officer (CIO) are engaged as the key stakeholders as early as possible to enable business units to manage data products in addition to their business-as-usual activities. There must also be business unit leaders willing to have their teams assume this new responsibility. If key stakeholders are less involved in your organizational planning, this may result in inadequate resources being allocated and the overall project failing. Fundamentally, Data Mesh is not just a technical architecture but rather an operating model shift towards distributed ownership of data and autonomous use of technology to enable business units to optimize locally for agility. Thinh Ha’s article on organizational features that are anti-candidates for Data Mesh is a must-read if you are considering this approach at your company.
At Google Cloud, we have built managed services to help companies like Delivery Hero modernize their analytics stack and implement Data Mesh practices.
Data Mesh promises domain-oriented, decentralized data ownership and architecture where each domain is responsible for creating and consuming data - which in turn allows faster scaling of the number of data sources and use cases. You can achieve this by having federated computation and access layers while keeping your data in BigQuery and BigLake. Then you can join data from different domains, even raw data if needed, with no duplication or data movement. Analytics Hub is then used for discovery together with Dataplex. In addition, Dataplex provides the ability to handle centralized administration and governance. This is further complemented by having Looker, which fits in perfectly as it allows scientists, analysts, and even business users to access their data with a single semantic model. This universal semantic layer abstracts data consumption for business users and harmonizes data access permissions.
In addition, BigQuery StorageApi allows data access from 20+ APIs at high performance without impacting other workloads, due to its true separation of storage and computation. In this way, BigQuery acts as a lakehouse, bringing together data warehouse and data lake functionality, and allowing different types and higher volumes of data. You can read more about lake houses in our recent open data lakehouse article. Through powerful federated queries, BigQuery can also process external data sources in object storage (Cloud Storage) for Parquet and ORC open source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive. All this can be done without moving the data. These components come together in the example architecture as outlined in Figure 2.
If you think a Data Mesh may be the right approach for your business, we encourage you to read our whitepaper on how to build a Data Mesh on Google Cloud. But before you consider building a Data Mesh, think through your company’s data philosophy and whether you’re organizationally ready for such a change. We recommend you read “What type of data processing organization are you?” to get started on this journey. The answer to this question is critical as Data Mesh may not be suitable for your organization as a whole if you have existing processes that cannot be federated immediately. If you are looking to modernize your analytics stack as a whole, check out our white paper on building a unified analytics platform. For more interesting insights on Data Mesh, you can read the full whitepaper here. We also have a guide to implementing a data mesh on Google Cloud here, that details the architecture possible on Google Cloud, the key functions and roles required to deliver on that architecture, and the considerations that should be taken into account in each major task in the data mesh.
It was an honor and privilege to work on this with Diptiman Raichaudhuri, Sergei Lilichenko, Shirley Cohen, Thinh Ha, Yu-lm Loh, Johan Pcikard, Yu-lm Loh and Maxime Lanciaux for support, work they have done and discussions.