You can use Dataplex to build a data mesh architecture. This guide shows you how to use Dataplex features, such as a lake, zones, and assets, to build a data mesh.
A data mesh is an organizational and technical approach that decentralizes data ownership among domain data owners. These owners provide the data as a product in a standard way and facilitate communication among different parts of the organization to distribute datasets across different locations. Learn more about data mesh architectures.
Objectives
In following this guide, you use the Dataplex entities to build a data mesh architecture:
- Create a Dataplex lake that will act as the domain for your data mesh.
- Add zones to your lake that will represent individual teams within each domain and provide managed data contracts.
- Attach assets that map to data stored in Cloud Storage.
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
Enable the Dataplex API.
Create a Cloud Storage bucket
You need a Cloud Storage bucket to store the data assets of your data mesh.
Follow the steps to create a Cloud Storage bucket, and:
- Name your bucket.
- For Location type, choose Region, and select us-central1 (Iowa) from the drop-down menu.
Create a domain
In the Google Cloud console, go to the Dataplex page:
Navigate to the Manage view.
Click Create to create a new lake, which will act as your data mesh.
In the Display name field, enter
My data mesh
.For Region, select
us-central1
.Select the Dataproc Metastore service you previously created and configured as the associated metastore.
Click Create.
Create zones in your lake
After creating a domain by creating a Dataplex lake, you can host managed data contracts and individual teams within the domain by using zones. There are two types of zones:
Raw zones are typically used to store data in any format from external sources in Cloud Storage. Raw zones are useful for data that requires further processing before it's ready for consumption.
Curated zones are used for structured data in Cloud Storage that must conform to certain file formats, and are organized in a hive-compatible directory layout. They are most useful for data that's ready for consumption and analysis.
Each domain (for example, sales
, customers
, products
) should have a raw
zone and a curated zone, at least.
Additional zones are used to manage data contracts between teams or to provide a more granular breakdown for teams within a given domain. For example, inventory management within the product domain. Data owners are able to manage the data within their domain and access it.
In Dataplex in the Google Cloud console, navigate to the Manage view.
Click the name of the lake (
My data mesh
) you'd like to add a zone to.In the Zones tab, click
Add Zone.In the Display name field, enter
My sub domain
. Dataplex automatically generates an ID for your zone.NOTE: The zone name becomes the name of a BigQuery dataset. Therefore, all zones hosted in the same Google Cloud project must have a unique ID, even if they exist within different lakes.
For Type, select Raw zone.
Click Create.
Attach assets to your zones
Attach data assets to your zone. A data asset, the storage resources that contain your data, can be a Cloud Storage bucket or a BigQuery dataset. This is the final step in creating your data mesh architecture.
In the Dataplex Manage view, click the lake you created (
My data mesh
).In the Zones tab, click the zone (
My sub domain
) to add the asset to.In the Assets tab, click
Add assetsClick Add an Asset.
For Type, select Cloud Storage bucket.
In the Display name field , enter
Data mesh asset
. Dataplex automatically generates an asset ID for you.In the Bucket field, click Browse.
- Select your bucket from the list.
- Click Select.
Click Done and then click Continue.
Click Continue to accept the default Advanced settings.
Click Submit to add your Cloud Storage bucket as a data asset to your zone.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete your data mesh architecture
In Dataplex in the Google Cloud console, navigate to the Manage view.
For the lake that you want to delete, click
View more, and then click Delete.Confirm the action by entering
delete
, and click Delete lake.
What's next
- Learn about data processing tasks
- Learn about discovering data
- Learn about using data quality tasks