You can use Dataplex to build a data mesh architecture. This guide shows you how to use Dataplex features, such as a lake, zones, and assets, to build a data mesh.
A data mesh is an organizational and technical approach that decentralizes data ownership among domain data owners. These owners provide the data as a product in a standard way and facilitate communication among different parts of the organization to distribute datasets across different locations. Learn more about data mesh architectures.
Objectives
In following this guide, you use the Dataplex entities to build a data mesh architecture:
- Create a Dataplex lake that will act as the domain for your data mesh.
- Add zones to your lake that will represent individual teams within each domain and provide managed data contracts.
- Attach assets that map to data stored in Cloud Storage.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
Enable the Dataplex API.
Create a Cloud Storage bucket
You need a Cloud Storage bucket to store the data assets of your data mesh.
Follow the steps to create a Cloud Storage bucket, and:
- Name your bucket.
- Under Location type, choose Region.
- Under Location, choose us-central1 (Iowa).
Create a domain
In the Google Cloud console, go to the Dataplex page:
Navigate to the Manage view.
Click Create to create a new lake, which will act as your data mesh.
Under Display name, enter
My data mesh
.Under Region, select
us-central1
.Select the Dataproc Metastore service you previously created and configured as the associated metastore.
Click Create.
Create zones in your lake
After creating a domain by creating a Dataplex lake, you can host managed data contracts and individual teams within the domain by using zones. There are two types of zones:
Raw zones are typically used to store data in any format from external sources in Cloud Storage and are useful for data that requires further processing before it is ready for consumption.
Curated zones are used for structured data in Cloud Storage that must conform to certain file formats, and are organized in a hive-compatible directory layout. They are most useful for data that's ready for consumption and analysis.
Each domain (for example, sales
, customers
, products
) should have a raw
zone and a curated zone, at least.
Additional zones are used to manage data contracts between teams or to provide a more granular breakdown for teams within a given domain (for example, inventory management within the product domain). Data owners are able to manage the data within their domain and access it.
In Dataplex in the Google Cloud console, navigate to the Manage view.
Click the name of the lake (
My data mesh
) you'd like to add a zone to.In the Zones tab, click
Add Zone.Under Display name, enter
My sub domain
. Dataplex automatically generates an ID for your zone.NOTE: The zone name will become the name of a BigQuery dataset. Therefore, all zones hosted in the same Google Cloud project must have a unique ID, even if they exist within different lakes.
Under Type, select Raw zone.
Click Create.
Attach assets to your zones
Attach data assets to your zone. A data asset, the storage resources that contain your data, can be a Cloud Storage bucket or a BigQuery dataset. This is the final step in creating your data mesh architecture.
In the Dataplex Manage view, click the lake you created (
My data mesh
).In the Zones tab, click the zone (
My sub domain
) to add the asset to.In the Assets tab, click
Add assetsClick Add an Asset.
Under Type, select Cloud Storage bucket.
In Display name, enter
Data mesh asset
. Dataplex automatically generates an asset ID for you.Next to Bucket, click Browse.
Click Done. Click Continue.
- Select your bucket from the list.
- Click Select.
Click Continue to accept the default Advanced settings.
Click Submit to add your Cloud Storage bucket as a data asset to your zone.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete your data mesh architecture
In Dataplex in the Google Cloud console, navigate to the Manage view.
Find the name of the Dataplex lake you want to remove.
Click the three dot menu to the right of the lake you wish to delete and click Delete.
Type "delete" and click Delete lake to confirm.
What's next
- Learn about data processing tasks
- Learn about discovering data
- Learn about using data quality tasks