Using Cloud Storage filesets

You can use Data Catalog APIs to create and manage Cloud Storage fileset entries (called "filesets" in the remainder of this document).

How Filesets Fit Within Data Catalog

Before looking at filesets, let's look at key Data Catalog concepts and how filesets fit within Data Catalog.

Entry
A Data Catalog entry represents a Google Cloud resource, such as a BigQuery dataset or table, Pub/Sub topic, or a Cloud Storage fileset. You create, search, and manage Data Catalog entries to explore and manage Google Cloud data resources. A fileset entry can include an optional schema of fileset data.
Entry Group

Entries are contained in an entry group. An entry group contains logically related entries together with Identity and Access Management policies that specify the users who can create, edit, and view entries within an entry group.

Data Catalog automatically creates an entry group for BigQuery entries ("@bigquery") and Pub/Sub topics ("@pubsub"). You create your own entry group to contain your Cloud Storage fileset entries and the IAM policies associated with those entries. Entry groups, like entries, can be searched.

The following illustration depicts how predefined and user-created entry groups fit in the Data Catalog data model.

Cloud Storage Fileset

A Cloud Storage fileset is an entry within a user-created entry group. It is defined by one or more file patterns that specify a set of one or more Cloud Storage files.

File pattern requirements:

Creating entry groups and filesets

Filesets must be placed within a user-created entry group. If you have not created an entry group, first create the entry group, then create the fileset within the entry group. You can set IAM policies on the entry group to define who has access to filesets and other entries within the entry group.

gcloud

1. Create an entry group

Use the gcloud data-catalog entry-groups create command to create an entry group with an attached schema and description.

Example:

gcloud data-catalog entry-groups create my_entrygroup \
    --location=us-central1

2. Create a fileset within the entry group

Use the gcloud data-catalog entries create command to create a fileset within an entry group. This gcloud command example, below, creates a fileset entry that includes schema of fileset data.

gcloud data-catalog entries create my_fileset_entry \  
    --location=us-central1 \  
    --entry-group=my_entrygroup \  
    --type=FILESET \  
    --gcs-file-patterns=gs://my-bucket/*.csv \  
    --schema-from-file=path_to_schema_file \  
    --description="Fileset description ..."

Flag notes:

  • --gcs-file-patterns: See File pattern requirements.
  • --schema-from-file: The following sample shows the JSON format of the schema text file accepted by the --schema-from-file flag.
    [
      {
        "column": "first_name",
        "description": "First name",
        "mode": "REQUIRED",
        "type": "STRING"
      },
      {
        "column": "last_name",
        "description": "Last name",
        "mode": "REQUIRED",
        "type": "STRING"
      },
      {
        "column": "address",
        "description": "Address",
        "mode": "REPEATED",
        "type": "STRING"
      }
    ]
    

Console

  1. Select Create entry group from the CREATE drop-down list, or from the Entry Groups section of the Data Catalog UI in the Google Cloud Console.

  2. Complete the Create Entry Group form, then click CREATE.

  3. The Entry group details page opens. With the ENTRIES tab selected, click CREATE.

  4. Complete the Create Fileset form.

    1. To attach a schema, click Define Schema to open the Schema form. Click + ADD FIELDS to add fields individually or toggle Edit as text in the upper right of the form to specify the fields in JSON format.
    2. Click Save to save the schema.
  5. Click Create to create the fileset.

Python

  1. Install the client library
  2. Set up application default credentials
  3. Run the code
    """
    This application demonstrates how to perform core operations with the
    Data Catalog API.
    
    For more information, see the README.md and the official documentation at
    https://cloud.google.com/data-catalog/docs.
    """
    
    # -------------------------------
    # Import required modules.
    # -------------------------------
    from google.api_core.exceptions import NotFound, PermissionDenied
    from google.cloud import datacatalog_v1
    
    
    # -------------------------------
    # Set your Google Cloud Platform project ID.
    # -------------------------------
    project_id = 'your-project-id'
    
    # -------------------------------
    # Currently, Data Catalog stores metadata in the
    # us-central1 region.
    # -------------------------------
    location = 'us-central1'
    
    # -------------------------------
    # Use Application Default Credentials to create a new
    # Data Catalog client. GOOGLE_APPLICATION_CREDENTIALS
    # environment variable must be set with the location
    # of a service account key file.
    # -------------------------------
    datacatalog = datacatalog_v1.DataCatalogClient()
    
    # -------------------------------
    # 1. Environment cleanup: delete pre-existing data.
    # -------------------------------
    # Delete any pre-existing Entry with the same name
    # that will be used in step 3.
    expected_entry_name = datacatalog_v1.DataCatalogClient\
        .entry_path(project_id, location, 'fileset_entry_group', 'fileset_entry_id')
    
    try:
        datacatalog.delete_entry(name=expected_entry_name)
    except (NotFound, PermissionDenied):
        pass
    
    # Delete any pre-existing Entry Group with the same name
    # that will be used in step 2.
    expected_entry_group_name = datacatalog_v1.DataCatalogClient\
        .entry_group_path(project_id, location, 'fileset_entry_group')
    
    try:
        datacatalog.delete_entry_group(name=expected_entry_group_name)
    except (NotFound, PermissionDenied):
        pass
    
    # -------------------------------
    # 2. Create an Entry Group.
    # -------------------------------
    entry_group_obj = datacatalog_v1.types.EntryGroup()
    entry_group_obj.display_name = 'My Fileset Entry Group'
    entry_group_obj.description = 'This Entry Group consists of ....'
    
    entry_group = datacatalog.create_entry_group(
        parent=datacatalog_v1.DataCatalogClient.location_path(project_id, location),
        entry_group_id='fileset_entry_group',
        entry_group=entry_group_obj)
    print('Created entry group: {}'.format(entry_group.name))
    
    
    # -------------------------------
    # 3. Create a Fileset Entry.
    # -------------------------------
    entry = datacatalog_v1.types.Entry()
    entry.display_name = 'My Fileset'
    entry.description = 'This fileset consists of ....'
    entry.gcs_fileset_spec.file_patterns.append('gs://my_bucket/*')
    entry.type = datacatalog_v1.enums.EntryType.FILESET
    
    # Create the Schema, for example when you have a csv file.
    columns = []
    columns.append(datacatalog_v1.types.ColumnSchema(
        column='first_name',
        description='First name',
        mode='REQUIRED',
        type='STRING'))
    
    columns.append(datacatalog_v1.types.ColumnSchema(
        column='last_name',
        description='Last name',
        mode='REQUIRED',
        type='STRING'))
    
    # Create sub columns for the addresses parent column
    subcolumns = []
    
    subcolumns.append(datacatalog_v1.types.ColumnSchema(
        column='city',
        description='City',
        mode='NULLABLE',
        type='STRING'))
    
    subcolumns.append(datacatalog_v1.types.ColumnSchema(
        column='state',
        description='State',
        mode='NULLABLE',
        type='STRING'))
    
    columns.append(datacatalog_v1.types.ColumnSchema(
        column='addresses',
        description='Addresses',
        mode='REPEATED',
        subcolumns = subcolumns,
        type='RECORD'))
    
    entry.schema.columns.extend(columns)
    
    entry = datacatalog.create_entry(
        parent=entry_group.name,
        entry_id='fileset_entry_id',
        entry=entry)
    print('Created entry: {}'.format(entry.name))
      

Java

  1. Install the client library
  2. Set up application default credentials
  3. Run the code
    /*
    This application demonstrates how to perform core operations with the
    Data Catalog API.
    
    For more information, see the README.md and the official documentation at
    https://cloud.google.com/data-catalog/docs.
    */
    
    package com.example.datacatalog;
    
    import com.google.cloud.datacatalog.v1.ColumnSchema;
    import com.google.cloud.datacatalog.v1.CreateEntryGroupRequest;
    import com.google.cloud.datacatalog.v1.CreateEntryRequest;
    import com.google.cloud.datacatalog.v1.Entry;
    import com.google.cloud.datacatalog.v1.EntryGroup;
    import com.google.cloud.datacatalog.v1.EntryGroupName;
    import com.google.cloud.datacatalog.v1.EntryName;
    import com.google.cloud.datacatalog.v1.EntryType;
    import com.google.cloud.datacatalog.v1.GcsFilesetSpec;
    import com.google.cloud.datacatalog.v1.LocationName;
    import com.google.cloud.datacatalog.v1.Schema;
    
    import com.google.cloud.datacatalog.v1.DataCatalogClient;
    
    public class CreateFilesetEntry {
    
      public static void createEntry() {
        // TODO(developer): Replace these variables before running the sample.
        String projectId = "my-project-id";
        String entryGroupId = "fileset_entry_group";
        String entryId = "fileset_entry_id";
        createEntry(projectId, entryGroupId, entryId);
      }
    
      /**
       * Create Fileset Entry
       */
      public static void createEntry(String projectId, String entryGroupId, String entryId) {
    
        // -------------------------------
        // Currently, Data Catalog stores metadata in the
        // us-central1 region.
        // -------------------------------
        String location = "us-central1";
    
        // Initialize client that will be used to send requests. This client only needs to be created
        // once, and can be reused for multiple requests. After completing all of your requests, call
        // the "close" method on the client to safely clean up any remaining background resources.
        try (DataCatalogClient dataCatalogClient = DataCatalogClient.create()) {
    
          // -------------------------------
          // 1. Environment cleanup: delete pre-existing data.
          // -------------------------------
          // Delete any pre-existing Entry with the same name
          // that will be used in step 3.
          try {
            dataCatalogClient.deleteEntry(
                EntryName.of(projectId, location, entryGroupId, entryId).toString());
          } catch (Exception e) {
            System.out.println("Entry does not exist.");
          }
    
          // Delete any pre-existing Entry Group with the same name
          // that will be used in step 2.
          try {
            dataCatalogClient.deleteEntryGroup(
                EntryGroupName.of(projectId, location, entryGroupId).toString());
          } catch (Exception e) {
            System.out.println("Entry Group does not exist.");
          }
    
          // -------------------------------
          // 2. Create an Entry Group.
          // -------------------------------
          // Construct the EntryGroup for the EntryGroup request.
          EntryGroup entryGroup =
              EntryGroup.newBuilder()
                  .setDisplayName("My Fileset Entry Group")
                  .setDescription("This Entry Group consists of ....")
                  .build();
    
          // Construct the EntryGroup request to be sent by the client.
          CreateEntryGroupRequest entryGroupRequest = CreateEntryGroupRequest.newBuilder()
              .setParent(LocationName.of(projectId, location).toString())
              .setEntryGroupId(entryGroupId)
              .setEntryGroup(entryGroup)
              .build();
    
          // Use the client to send the API request.
          EntryGroup entryGroupResponse = dataCatalogClient.createEntryGroup(entryGroupRequest);
    
          System.out.printf("\nEntry Group created with name: %s\n", entryGroupResponse.getName());
    
          // -------------------------------
          // 3. Create a Fileset Entry.
          // -------------------------------
          // Construct the Entry for the Entry request.
          Entry entry =
              Entry.newBuilder()
                  .setDisplayName("My Fileset")
                  .setDescription("This fileset consists of ....")
                  .setGcsFilesetSpec(
                      GcsFilesetSpec.newBuilder().addFilePatterns("gs://my_bucket/*").build())
                  .setSchema(
                      Schema.newBuilder()
                          .addColumns(
                              ColumnSchema.newBuilder()
                                  .setColumn("first_name")
                                  .setDescription("First name")
                                  .setMode("REQUIRED")
                                  .setType("STRING")
                                  .build())
                          .addColumns(
                              ColumnSchema.newBuilder()
                                  .setColumn("last_name")
                                  .setDescription("Last name")
                                  .setMode("REQUIRED")
                                  .setType("STRING")
                                  .build())
                          .addColumns(
                              ColumnSchema.newBuilder()
                                  .setColumn("addresses")
                                  .setDescription("Addresses")
                                  .setMode("REPEATED")
                                  .setType("RECORD")
                                  .addSubcolumns(
                                      ColumnSchema.newBuilder()
                                          .setColumn("city")
                                          .setDescription("City")
                                          .setMode("NULLABLE")
                                          .setType("STRING")
                                          .build())
                                  .addSubcolumns(
                                      ColumnSchema.newBuilder()
                                          .setColumn("state")
                                          .setDescription("State")
                                          .setMode("NULLABLE")
                                          .setType("STRING")
                                          .build())
                                  .build())
                          .build())
                  .setType(EntryType.FILESET)
                  .build();
    
          // Construct the Entry request to be sent by the client.
          CreateEntryRequest entryRequest = CreateEntryRequest.newBuilder()
              .setParent(entryGroupResponse.getName())
              .setEntryId(entryId)
              .setEntry(entry)
              .build();
    
          // Use the client to send the API request.
          Entry entryResponse = dataCatalogClient.createEntry(entryRequest);
    
          System.out.printf("\nEntry created with name: %s\n", entryResponse.getName());
    
    
        } catch (Exception e) {
          System.out.println("Error in create entry process:\n" + e.toString());
        }
      }
    }
    

Node.js

  1. Install the client library
  2. Set up application default credentials
  3. Run the code
    /**
     * This application demonstrates how to create an Entry Group and a fileset
     * Entry with the Cloud Data Catalog API.
    
     * For more information, see the README.md under /datacatalog and the
     * documentation at https://cloud.google.com/data-catalog/docs.
     */
    const main = async (
      projectId = process.env.GCLOUD_PROJECT,
      entryGroupId,
      entryId
    ) => {
        // -------------------------------
      // Import required modules.
      // -------------------------------
      const { DataCatalogClient } = require('@google-cloud/datacatalog').v1;
      const datacatalog = new DataCatalogClient();
    
      // -------------------------------
      // Currently, Data Catalog stores metadata in the
      // us-central1 region.
      // -------------------------------
      const location = "us-central1";
    
      // -------------------------------
      // 1. Environment cleanup: delete pre-existing data.
      // -------------------------------
      // Delete any pre-existing Entry with the same name
      // that will be used in step 3.
      try {
        const formattedName = datacatalog.entryPath(projectId, location, entryGroupId, entryId);
        await datacatalog.deleteEntry({ name: formattedName });
      } catch (err) {
        console.log('Entry does not exist.');
      }
    
      // Delete any pre-existing Entry Group with the same name
      // that will be used in step 2.
      try {
        const formattedName = datacatalog.entryGroupPath(projectId, location, entryGroupId);
        await datacatalog.deleteEntryGroup({ name: formattedName });
      } catch (err) {
        console.log('Entry Group does not exist.');
      }
    
      // -------------------------------
      // 2. Create an Entry Group.
      // -------------------------------
      // Construct the EntryGroup for the EntryGroup request.
      const entryGroup = {
        displayName: 'My Fileset Entry Group',
        description: 'This Entry Group consists of ....'
      }
    
      // Construct the EntryGroup request to be sent by the client.
      const entryGroupRequest = {
        parent: datacatalog.locationPath(projectId, location),
        entryGroupId: entryGroupId,
        entryGroup: entryGroup,
      };
    
      // Use the client to send the API request.
      await datacatalog.createEntryGroup(entryGroupRequest)
    
      // -------------------------------
      // 3. Create a Fileset Entry.
      // -------------------------------
      // Construct the Entry for the Entry request.
      const FILESET_TYPE = 4
    
      const entry = {
        displayName: 'My Fileset',
        description: 'This fileset consists of ....',
        gcsFilesetSpec: {filePatterns: ['gs://my_bucket/*']},
        schema: {
          columns: [
            {
              column: 'city',
              description: 'City',
              mode: 'NULLABLE',
              type: 'STRING',
            },
            {
              column: 'state',
              description: 'State',
              mode: 'NULLABLE',
              type: 'STRING',
            },
            {
              column: 'addresses',
              description: 'Addresses',
              mode: 'REPEATED',
              subcolumns: [
                {
                  column: 'city',
                  description: 'City',
                  mode: 'NULLABLE',
                  type: 'STRING',
                },
                {
                  column: 'state',
                  description: 'State',
                  mode: 'NULLABLE',
                  type: 'STRING',
                },
              ],
              type: 'RECORD',
            },
          ],
        },
        type: FILESET_TYPE,
      };
    
      // Construct the Entry request to be sent by the client.
      const request = {
        parent: datacatalog.entryGroupPath(projectId, location, entryGroupId),
        entryId: entryId,
        entry: entry,
      };
    
      // Use the client to send the API request.
      const [response] = await datacatalog.createEntry(request)
    
      console.log(response);
    };
    // [END datacatalog_create_fileset_quickstart_tag]
    
    // node createFilesetEntry.js   
    main(...process.argv.slice(2));
    

REST & CMD LINE

If you do not have access to Cloud Client libraries for your language or want to test the API using REST requests, see the following examples and refer to the Data Catalog REST API entryGroups.create and entryGroups.entries.create documentation.

1. Create an entry group

Before using any of the request data below, make the following replacements:

  • project-id: Your GCP project ID
  • entryGroupId: The ID must begin with a letter or underscore, contain only English letters, numbers and underscores, and be at most 64 characters.
  • displayName: The textual name for the entry group.

HTTP method and URL:

POST https://datacatalog.googleapis.com/v1/projects/project-id/locations/us-central1/entryGroups?entryGroupId=entryGroupId

Request JSON body:

{
  "displayName": "Entry Group display name"
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/my_projectid/locations/us-central1/entryGroups/my_entry_group",
  "displayName": "Entry Group display name",
  "dataCatalogTimestamps": {
    "createTime": "2019-10-19T16:35:50.135Z",
    "updateTime": "2019-10-19T16:35:50.135Z"
  }
}

2. Create a fileset within the entry group

Before using any of the request data below, make the following replacements:

  • project_id: Your GCP project ID
  • entryGroupId: ID of existing entryGroup. Fileset will be create in this sntryGroup.
  • entryId: EntryId of the new fileset. ID must begin with a letter or underscore, contain only English letters, numbers and underscores, and be at most 64 characters.
  • description: Fileset description.
  • displayName: The textual name for the fileset entry.
  • filePatterns: Must start with "gs://bucket_name/". See File pattern requirements.
  • schema: Fileset schema.

    Example JSON schema:
    { ...
      "schema": {
        "columns": [
          {
            "column": "first_name",
            "description": "First name",
            "mode": "REQUIRED",
            "type": "STRING"
          },
          {
            "column": "last_name",
            "description": "Last name",
            "mode": "REQUIRED",
            "type": "STRING"
          },
          {
            "column": "address",
            "description": "Address",
            "mode": "REPEATED",
            "subcolumns": [
              {
                "column": "city",
                "description": "City",
                "mode": "NULLABLE",
                "type": "STRING"
              },
              {
                "column": "state",
                "description": "State",
                "mode": "NULLABLE",
                "type": "STRING"
              }
            ],
            "type": "RECORD"
          }
        ]
      }
    ...
    }
    

HTTP method and URL:

POST https://datacatalog.googleapis.com/v1/projects/project_id/locations/us-central1/entryGroups/entryGroupId/entries?entryId=entryId

Request JSON body:

{
  "description": "Fileset description.",
  "displayName": "Display name",
  "gcsFilesetSpec": {
    "filePatterns": [
      "gs://bucket_name/file_pattern"
    ]
  },
  "type": "FILESET",
  "schema": { schema }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/my_project_id/locations/us-central1/entryGroups/my_entryGroup_id/entries/my_entry_id",
  "type": "FILESET",
  "displayName": "My Fileset",
  "description": "My Fileset description.",
  "schema": {
    "columns": [
      {
        "type": "STRING",
        "description": "First name",
        "mode": "REQUIRED",
        "column": "first_name"
      },
      {
        "type": "STRING",
        "description": "Last name",
        "mode": "REQUIRED",
        "column": "last_name"
      },
      {
        "type": "RECORD",
        "description": "Address",
        "mode": "REPEATED",
        "column": "address",
        "subcolumns": [
          {
            "type": "STRING",
            "description": "City",
            "mode": "NULLABLE",
            "column": "city"
          },
          {
            "type": "STRING",
            "description": "State",
            "mode": "NULLABLE",
            "column": "state"
          }
        ]
      }
    ]
  },
  "gcsFilesetSpec": {
    "filePatterns": [
      "gs://my_bucket_name/chicago_taxi_trips/csv/shard-*.csv"
    ]
  },
  "sourceSystemTimestamps": {
    "createTime": "2019-10-23T23:11:26.326Z",
    "updateTime": "2019-10-23T23:11:26.326Z"
  },
"linkedResource": "//datacatalog.googleapis.com/projects/my_project_id/locations/us-central1/entryGroups/my_entryGroup_id/entries/my_entry_id"
}

IAM Roles, Permissions, and Policies

Data Catalog defines entry and entryGroup roles to facilitate permission management of filesets and other Data Catalog resources.

Entry roles Description
dataCatalog.entryOwner Owner of a particular entry or group of entries.
  • Permissions:
    • datacatalog.entries.(*)
    • datacatalog.entryGroups.get
  • Applicability:
    • Organization, project, and entryGroup.
dataCatalog.entryViewer Can view details of entry & entryGroup.
  • Permissions
    • datacatalog.entries.get
    • datacatalog.entryGroups.get
  • Applicability:
    • Organization, project, and entryGroup.
entryGroup roles Description
dataCatalog.entryGroupOwner Owner of a particular entryGroup.
  • Permissions:
    • datacatalog.entryGroups.(*)
    • datacatalog entries.(*)
  • Applicability:
    • Organization, project, and entryGroups level.
dataCatalog.entryGroupCreator Can create entryGroups within a project. The creator of an entryGroup is automatically granted the dataCatalog.entryGroupOwner role.
  • Permissions
    • datacatalog.entryGroups.(get | create)
  • Applicability:
    • Organization and project level.

Setting IAM policies

Users with datacatalog.<resource>.setIamPolicy permission can set IAM policies on Data Catalog entry groups and other Data Catalog resources (see Data Catalog roles).

gcloud

Console

Navigate to the Entry group details page in the Data Catalog UI then use the IAM panel located on the right side to grant or revoke permissions.

Granting Entry Group roles

Example 1:

A company with different business contexts for its filesets creates separate order-files and user-files entry groups:

The company grants users the EntryGroup Viewer role for order-files, meaning they can only search for entries contained in that entry group. Their search results do not return entries in user-files entry group.

Example 2:

A company grants the EntryGroup Viewer role to a user only in the project_entry_group project. The user will only be able to view entries within that project.

Searching filesets

Users can restrict the scope of search in Data Catalog by using the type facet. type=entry_group restricts the search query to entry groups while type=fileset searches only for filesets. type facets can be used in conjunction with other facets, such as projectid.

gcloud

  • Search for entry groups in a project:

    gcloud data-catalog search \  
        --include-project-ids=my-project
        "projectid=my-project type=entry_group"
    

  • Search for all entry groups you can access:

    gcloud data-catalog search \  
        --include-project-ids=my-project
        "type=entry_group"
    

  • Search for filesets in a project:

    gcloud data-catalog search \  
        --include-project-ids=my-project
        "type=entry.fileset"
    

  • Search for filesets in a project - simplified syntax:

    gcloud data-catalog search \  
        --include-project-ids=my-project
        "type=fileset"