Creating HTTP batch requests for Data Catalog

Each HTTP connection that your application makes requires a certain amount of overhead. Data Catalog API requests suppors batching, allowing you to combine several API calls into a single HTTP request. You may want to use HTTP batching if you have many small requests to make, and want to minimize HTTP request overhead. Note that batching reduces overhead, but requests within a batch still count as multiple requests for API quota purposes.

For generic documentation on using HTTP batch with Google Cloud, see the Google API Python client documentation.

Creating HTTP batch requests in Python

To use batch requests to create or manipulate entries in Data Catalog, you first need to search for the entries you want to change using catalog.search() or entries.lookup().

Next, follow these steps to build an HTTP batch request using the Google Python API:

  1. Create a BatchHttpRequest object by calling new_batch_http_request() or with the BatchHttpRequest() constructor. You may pass in a callback, which will be called in response to each request.
  2. Call add() on the BatchHttpRequest object for each request you want to execute. If you passed a callback when creating your BatchHttpRequest object, each add() may include parameters to be passed to the callback.
  3. After you've added the requests, call execute() on the BatchHttpRequest object to execute them. The execute() function blocks until all callbacks have been called.

Requests in a BatchHttpRequest may be executed in parallel, and there are no guarantees for the order of execution. This means requests in the same batch shouldn't be dependent on each other. For example, you shouldn't create an EntryGroup and Entry belonging to it in the same request, as the creation of the Entry may execute before creation of the EntryGroup (causing execution to fail).

Batch requests with regional endpoints

When using HTTP batch requests with Data Catalog regional API endpoints, all API requests in a batch must belong to the same region. When executing the batch, you must call the correct regional endpoint. For example, if your resources are in us-central1, call https://us-central1-datacatalog.googleapis.com/batch.

Region-independent APIs

Region-independent APIs (such as catalog.lookup() and entries.search() can be grouped with each other, but must not be grouped with region-dependent APIs. For region-independent APIs, use the endpoint: https://datacatalog.googleapis.com/batch.

Example

This sample Python application demonstrates how to use an HTTP batch request to create multiple tags from a tag template using the Data Catalog API.

 
from googleapiclient.discovery import build
from googleapiclient.http import BatchHttpRequest
from oauth2client.service_account import ServiceAccountCredentials
import uuid

#-------------------------------------------------------------#
# 0. Helper and initialization logic
#-------------------------------------------------------------#

# Set the environment configuration.
service_key_file_location = '[SA_PATH]'

project_id = '[MY_PROJECT_ID]'

# Helper container to store results.
class DataContainer:
    def __init__(self):
        self.data = {}

    def callback(self, request_id, response, exception):
        if exception is not None:
            print('request_id: {}, exception: {}'.format(request_id, str(exception)))
            pass
        else:
            print(request_id)
            self.data[request_id] = response


# Helper function to build the Discovery Service config.
def get_service(api_name, api_version, scopes, key_file_location):
    """
    Get a service that communicates to a Google API.

    Args:
        api_name: The name of the API to connect to.
        api_version: The API version to connect to.
        scopes: A list auth scopes to authorize for the application.
        key_file_location: The path to a valid service account JSON key file.

    Returns:
        A service that is connected to the specified API.
    """
    credentials = ServiceAccountCredentials.from_json_keyfile_name(
        key_file_location, scopes=scopes)

    # Build the service object.
    service = build(api_name, api_version, credentials=credentials)

    return service

# Helper function to create a UUID for each request
def generated_uui():
    return str(uuid.uuid4())

def create_batch_request(callback):
    # For more info on supported regions
    # check: https://cloud.google.com/data-catalog/docs/concepts/regions

    region='us-datacatalog.googleapis.com'

    return BatchHttpRequest(batch_uri='https://{}/batch'.format(region), callback=callback)

container = DataContainer()

# Scope to set up the Discovery Service config.
scope = 'https://www.googleapis.com/auth/cloud-platform'

# Create service.
service = get_service(
    api_name='datacatalog',
    api_version='v1',
    scopes=[scope],
    key_file_location=service_key_file_location)

# Create the batch request config.
batch = create_batch_request(container.callback)

#-------------------------------------------------------------#
# 1. Start by fetching a list of entries using search call
#-------------------------------------------------------------#

# Create the search request body.
# This example searches for all BigQuery tables in a project.
search_request_body = {
  'query': 'type=TABLE system=BIGQUERY',
  'scope': {'includeProjectIds': [project_id]}
}

# Generated a unique ID for the request.
request_id = generated_uui()

# Add the request to the batch client.
batch.add(service.catalog().search(body=search_request_body), request_id=request_id)

# Execute the batch request.
batch.execute()

# Uncomment to verify the full response from search.
# print(container.data)

response = container.data[request_id]

results = response['results']

first_table = results[0]

# Verify that a first table is present.
print(first_table)

second_table = results[1]

# Verify that a second table is present
print(second_table)

#-------------------------------------------------------------------#
# 2. Send the batch request to attach tags over the entire result set
#-------------------------------------------------------------------#

# Create a new container
container = DataContainer()

# Create a new batch request
batch = create_batch_request(container.callback)

# Set the template name config
template_name = 'projects/[MY_PROJECT_ID]/locations/[MY-LOCATION]/tagTemplates/[MY-TEMPLATE-NAME]'

for result in results:
    # Generated a unique id for request.
    request_id = generated_uui()

    # Add the entry name as the tag parent.
    parent=result['relativeResourceName']

    # Create the tag request body.
    create_tag_request_body = {
      'template': template_name,
       # CHANGE for your template field values.
      'fields': {'etl_score': {'doubleValue': 0.5}}
    }

    # Add requests to the batch client.
    batch.add(service.projects().locations().
              entryGroups().entries().tags().
              create(body=create_tag_request_body,
                     parent=parent),
              request_id=request_id)

# Execute the batch request.

# Since the Batch Client works with regions
# If you receive [HttpError 400 errors]
# 1. Verify the region you used to create the Batch client
# 2. Verify the region where the Entry is located.
# 3. verify the region of the parent tag template used by the tag.

batch.execute()

# Uncomment to verify the full response from tag creation.
# print(container)