Sharing Datasets across organizations with BigQuery Analytics Hub
Urvashi Chaudhary
Strategic Cloud Engineer, Data
Manan Kshatriya
Strategic Cloud Engineer, Data
Analytics Hub is a data exchange platform within BigQuery that helps organizations streamline the accessibility of data and analytics assets. It provides a unified environment where data analysts, data scientists, and users can collaborate, share data, and gain insights from various data sources.
Here are some of the benefits of using Analytics Hub:
- Data sharing at scale: Analytics Hub makes it easy for data analysts, data scientists, and users to collaborate and share data at scale.
- Improved data security: Analytics Hub, built on top of BigQuery, enables robust security measures to protect data from unauthorized access.
- Access to public datasets: Analytics hub provides access to thousands of valuable datasets, including first-party data from Google
- Built-in features for data providers: Analytics hub provides access to subscriber’s usage metrics and subscription management for data providers.
This blog provides a detailed guide on setting up Analytics Hub for sharing datasets across organizations using Google Cloud, including essential steps and considerations.
Architecture overview
The above diagram shows cross-organizational representation. There are two organizations, Publisher and Subscriber. Project (A) under Publisher organization contains a private data exchange where datasets from all other projects under the same organization are listed.
The architecture above uses a dedicated project for setting up Analytics Hub data exchange. It helps in the following manner:
- Improved management: A dedicated project can make it easier to manage the multiple public and private data exchanges. You can create custom permissions and roles, and you can use billing and cost management tools to track your spending.
- Access to Subscriber usage metrics: A dedicated project will help you easily collect subscriber’s usage metrics by using INFORMATION_SCHEMA view from the project.
- Improved security: With a dedicated project, you can manage authorized subscribers from one place which helps in reducing the risk of unauthorized access or data breaches.
By default, a data exchange is private and only users or groups that have access to that exchange can view or subscribe to the data. After the user subscribes to the listing, a linked dataset is created in the target project accessible to its users as a READ ONLY dataset in BigQuery.
Create data exchanges and listings
First you should enable the Analytics Hub API using the command
gcloud services enable
analyticshub.googleapis.com
Create data exchange
You should create a data exchange in the publisher project. The permission analyticshub.dataExchanges.create
is required to perform the above step. Please follow the link to know how to provide IAM permissions.
Create listing
You should create a listing in the data exchange created above. At the least, following permissions are required for creating a listing:
analyticshub.listings.create
(On data exchange level or higher)bigquery.datasets.get
(on dataset level or higher)bigquery.datasets.update
(on dataset level or higher)
Above permissions are covered under Analytics Hub Publisher/Analytics Hub Admin and BigQuery Data Owner/BigQuery Admin roles.
Add subscriber role
To give subscribers access to a private listing, you must grant the Analytics Hub Subscriber
role to the users for that listing.
In the following sections, you will learn how the subscriber user/group can view the listings available under a data exchange and subscribe to them programmatically.
Steps to view & subscribe to a listing
Below set of IAM roles/permissions are required to subscribe to a listing in your target project:
bigquery.datasets.create
permission at project level inside subscriber organization. This enables the user to create a linked dataset in their own project. This permission is covered as part of the roleroles/bigquery.user
roles/analyticshub.subscriber
role on listing level or at data exchange level in the publisher organization.
Authenticate gcloud (or Cloud Shell) with an existing service account having required permissions using the following command:
If the key file generation is not allowed, the alternative option is to create a VM and assign the above service account. Login to the VM and you can proceed further.
You can also perform the following steps by authenticating as a user.
List available listings
To view all the listings available under the data exchange, user can run the below curl command:
Where project
refers to a projectId or project_number, locationId
refers to a Google Cloud region and DATA_EXCHANGE_ID
refers to Data Exchange ID. A truncated sample response is provided below.
Output
Subscribe to a listing
For subscribing to a listing in the Publisher organization, the Analytics Hub administrator must share a link to that data exchange & project ID with the subscribers. The link has fields data_exchange_id & listing_id
which will be used by the user for the subscribing process. Run the below cURL command to subscribe to the desired listing. If the request is successful the response body will be empty. See API documentation for details.
The request body ‘-d’ specifies the destination project and dataset where the linked dataset should be created in the subscriber’s organization.
Alerting users on new listings
Enable audit logs
You should enable BigQuery Analytics Hub API Audit logs before setting up log based alerts in the next step.
Setup alert for create listing
You can set alerts in cloud monitoring which will create an incident whenever a listing is created in the data exchange. Below is the policy.json file which creates a log based alert whenever a new listing is created in the defined project.
The gcloud command to activate this alert is
Subscription metrics using SCHEMATA_LINK view
The subscription metadata is exposed via INFORMATION_SCHEMA.SCHEMATA_LINKS view. And, the metadata can be aggregated to build useful metrics & to understand usage patterns of subscribers. Below are some sample queries to get started:
1. Number of linked datasets (i.e. subscriptions) created for each listing, sorted by number of subscriptions in descending order.
Where REGION
is any dataset region. For example, region-us
2. Number of unique subscribers for each listing (based on project level or organization level)
The unique number of subscribers can be based on project level or at organization level. To get the number of unique subscribers at organization level, you can use the below query:
To get the number of unique subscribers based on project level, replace column LINKED_SCHEMA_ORG_DISPLAY_NAME
with LINKED_SCHEMA_CATALOG_NAME
in the above query.
This should give you a fair idea of how to share data across organizations using Bigquery Analytics Hub and analyze the usage metrics of subscribers. To learn more about Analytics Hub visit the Analytics Hub page and get started today.