Cloud Storage transfers

The BigQuery Data Transfer Service for Cloud Storage allows you to schedule recurring data loads from Cloud Storage to BigQuery.

Before you begin

Before you create a Cloud Storage transfer:

Limitations

Recurring transfers from Cloud Storage to BigQuery are subject to the following limitations:

  • All files matching the patterns defined by either a wildcard or by runtime parameters for your transfer must share the same schema you defined for the destination table, or the transfer will fail. Table schema changes between runs also causes the transfer to fail.
  • Because Cloud Storage objects can be versioned, it's important to note that archived Cloud Storage objects are not supported for BigQuery transfers. Objects must be live to be transferred.
  • Unlike individual loads of data from Cloud Storage to BigQuery, for ongoing transfers you need to create the destination table and its schema in advance of setting up the transfer. BigQuery cannot create the table as part of the recurring data transfer process.
  • Transfers from Cloud Storage are always triggered with the WRITE_APPEND preference, which appends data to the destination table. For more information, see the description of the writeDisposition field of the JobConfigurationLoad object.
  • BigQuery Data Transfer Service does not guarantee all files will be transferred or transferred only once if Cloud Storage files are touched while in mid-transfer.
  • If your dataset's location is set to a value other than US, the regional or multi-regional Cloud Storage bucket must be in the same region as the dataset.
  • BigQuery Data Transfer Service does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.

  • Depending on the format of your Cloud Storage source data, there may be additional limitations. For more information, see:

  • Your Cloud Storage bucket must be in a region or multi-region that is compatible with the region or multi-region of the destination dataset in BigQuery. This is known as colocation. See Cloud Storage transfer Data locations for details.

Minimum intervals

  • Source files are picked up for transfer immediately, with no minimum file age.
  • The minimum interval time between recurring transfers is 15 minutes. The default interval for a recurring transfer is every 24 hours.

Required permissions

When you load data into BigQuery, you need permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you'll also need access to the bucket that contains your data. Ensure that you have the following required permissions:

  • BigQuery: Ensure that the person creating the transfer has the following permissions in BigQuery:

    • bigquery.transfers.update permissions to create the transfer
    • Both bigquery.datasets.get and bigquery.datasets.update permissions on the target dataset

    The bigquery.admin predefined IAM role includes bigquery.transfers.update, bigquery.datasets.update and bigquery.datasets.get permissions. For more information on IAM roles in BigQuery Data Transfer Service, see Access control reference.

  • Cloud Storage: storage.objects.get permissions are required on the individual bucket or higher. If you are using a URI wildcard, you must also have storage.objects.list permissions. If you would like to delete the source files after each successful transfer, you also need storage.objects.delete permissions. The storage.objectAdmin predefined IAM role includes all of these permissions.

Setting up a Cloud Storage transfer

To create a Cloud Storage transfer in BigQuery Data Transfer Service:

Console

  1. Go to the BigQuery page in the Cloud Console.

    Go to the BigQuery page

  2. Click Transfers.

  3. Click Create.

  4. On the Create Transfer page:

    • In the Source type section, for Source, choose Cloud Storage.

      Transfer source

    • In the Transfer config name section, for Display name, enter a name for the transfer such as My Transfer. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.

      Transfer name

    • In the Schedule options section, for Schedule, leave the default value (Start now) or click Start at a set time.

      • For Repeats, choose an option for how often to run the transfer. The minimum interval is 15 minutes.
        • Daily (default)
        • Weekly
        • Monthly
        • Custom. For Custom Schedule, enter a custom frequency; for example, every day 00:00. See Formatting the schedule.
        • On-demand
      • For Start date and run time, enter the date and time to start the transfer. If you choose Start now, this option is disabled.

        Transfer schedule

    • In the Destination settings section, for Destination dataset, choose the dataset you created to store your data.

      Transfer dataset

    • In the Data source details section:

      • For Destination table, enter the name of your destination table. The destination table must follow the table naming rules. Destination table names also support parameters.
      • For Cloud Storage URI, enter the Cloud Storage URI. Wildcards and parameters are supported.
      • For Write preference, choose:

        • APPEND to append new data to your existing destination table or
        • MIRROR to refresh data within the destination table, to reflect modified data in the source. MIRROR overwrites a fresh copy of data in the destination table.
      • For Delete source files after transfer, check the box if you want to delete the source files after each successful transfer. Delete jobs are best effort. Delete jobs do not retry if the first effort to delete the source files fails.

      • In the Transfer Options section:

        • Under All Formats:
          • For Number of errors allowed, enter the maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an 'invalid' error is returned in the job result, and the job fails. The default value is 0.
          • (Optional) For Decimal target types, enter a comma-separated list of possible SQL data types that the source decimal values could be converted to. Which SQL data type is selected for conversion depends on the following conditions:
            • The data type selected for conversion will be the first data type in the following list that supports the precision and scale of the source data, in this order: NUMERIC, BIGNUMERIC, and STRING.
            • If none of the listed data types will support the precision and the scale, the data type supporting the widest range in the specified list is selected. If a value exceeds the supported range when reading the source data, an error will be thrown.
            • The data type STRING supports all precision and scale values.
            • If this field is left empty, the data type will default to "NUMERIC,STRING" for ORC, and "NUMERIC" for the other file formats.
            • This field cannot contain duplicate data types.
            • The order of the data types that you list in this field is ignored.
        • Under JSON, CSV:
          • For Ignore unknown values, check the box if you want the transfer to drop data that does not fit the destination table's schema.
        • Under CSV:

          • For Field delimiter, enter the character that separates fields. The default value is a comma.
          • For Header rows to skip, enter the number of header rows in the source file(s) if you don't want to import them. The default value is 0.
          • For Allow quoted newlines, check the box if you want to allow newlines within quoted fields.
          • For Allow jagged rows, check the box if you want to allow the transfer of rows with missing NULLABLE columns.

      Cloud Storage source details

    • (Optional) In the Notification options section:

      • Click the toggle to enable email notifications. When you enable this option, the transfer administrator receives an email notification when a transfer run fails.
      • For Select a Pub/Sub topic, choose your topic name or click Create a topic. This option configures Pub/Sub run notifications for your transfer.
  5. Click Save.

bq

Enter the bq mk command and supply the transfer creation flag — --transfer_config. The following flags are also required:

  • --data_source
  • --display_name
  • --target_dataset
  • --params
bq mk \
--transfer_config \
--project_id=project_id \
--data_source=data_source \
--display_name=name \
--target_dataset=dataset \
--params='parameters'

Where:

  • project_id is your project ID. If --project_id isn't supplied to specify a particular project, the default project is used.
  • data_source is the data source — google_cloud_storage.
  • name is the display name for the transfer configuration. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.
  • dataset is the target dataset for the transfer configuration.
  • parameters contains the parameters for the created transfer configuration in JSON format. For example: --params='{"param":"param_value"}'.
    • For Cloud Storage, you must supply the data_path_template, the destination_table_name_template and the file_format parameters. data_path_template is the Cloud Storage URI that contains your files to be transferred, which can include one wildcard. The destination_table_name_template is the name of your destination table. For file_format, indicate the type of files you wish to transfer: CSV, JSON, AVRO, PARQUET, or ORC. The default value is CSV.
    • For all file_format values, you can include the optional param max_bad_records. The default value is 0.
    • For all file_format values, you can include the optional param decimal_target_types. decimal_target_types is a comma-separated list of possible SQL data types that the source decimal values could be converted to. If this field is not provided, the datatype will default to "NUMERIC,STRING" for ORC, and "NUMERIC" for the other file formats.
    • For the JSON or CSV values in file_format, you can include the optional param ignore_unknown_values. This param will be ignored if you haven't selected CSV or JSON for the file_format.
    • For CSV file_format, you can include the optional param field_delimiter for the character that separates fields. The default value is a comma. This param will be ignored if you haven't selected CSV for the file_format.
    • For CSV file_format, you can include the optional param skip_leading_rows to indicate header rows you don't want to import. The default value is 0. This param will be ignored if you haven't selected CSV for the file_format.
    • For CSV file_format, you can include the optional param allow_quoted_newlines if you want to allow newlines within quoted fields. This param will be ignored if you haven't selected CSV for the file_format.
    • For CSV file_format, you can include the optional param allow_jagged_rows if you want to accept rows that are missing trailing optional columns. The missing values will be filled in with NULLs. This param will be ignored if you haven't selected CSV for the file_format.
    • Optional param delete_source_files will delete the source files after each successful transfer. (Delete jobs do not retry if the first effort to delete the source files fails.) The default value for the delete_source_files is false.

For example, the following command creates a Cloud Storage transfer named My Transfer using a data_path_template value of gs://mybucket/myfile/*.csv, target dataset mydataset, and file_format CSV. This example includes non-default values for the optional params associated with the CSV file_format.

The transfer is created in the default project:

bq mk --transfer_config \
--target_dataset=mydataset \
--display_name='My Transfer' \
--params='{"data_path_template":"gs://mybucket/myfile/*.csv",
"destination_table_name_template":"MyTable",
"file_format":"CSV",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":"|",
"skip_leading_rows":"1",
"allow_quoted_newlines":"true",
"allow_jagged_rows":"false",
"delete_source_files":"true"}' \
--data_source=google_cloud_storage

After running the command, you receive a message like the following:

[URL omitted] Please copy and paste the above URL into your web browser and follow the instructions to retrieve an authentication code.

Follow the instructions and paste the authentication code on the command line.

API

Use the projects.locations.transferConfigs.create method and supply an instance of the TransferConfig resource.

Java

import com.google.api.gax.rpc.ApiException;
import com.google.cloud.bigquery.datatransfer.v1.CreateTransferConfigRequest;
import com.google.cloud.bigquery.datatransfer.v1.DataTransferServiceClient;
import com.google.cloud.bigquery.datatransfer.v1.ProjectName;
import com.google.cloud.bigquery.datatransfer.v1.TransferConfig;
import com.google.protobuf.Struct;
import com.google.protobuf.Value;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

// Sample to create google cloud storage transfer config
public class CreateCloudStorageTransfer {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    final String projectId = "MY_PROJECT_ID";
    String datasetId = "MY_DATASET_ID";
    String tableId = "MY_TABLE_ID";
    // GCS Uri
    String sourceUri = "gs://cloud-samples-data/bigquery/us-states/us-states.csv";
    String fileFormat = "CSV";
    String fieldDelimiter = ",";
    String skipLeadingRows = "1";
    Map<String, Value> params = new HashMap<>();
    params.put(
        "destination_table_name_template", Value.newBuilder().setStringValue(tableId).build());
    params.put("data_path_template", Value.newBuilder().setStringValue(sourceUri).build());
    params.put("write_disposition", Value.newBuilder().setStringValue("APPEND").build());
    params.put("file_format", Value.newBuilder().setStringValue(fileFormat).build());
    params.put("field_delimiter", Value.newBuilder().setStringValue(fieldDelimiter).build());
    params.put("skip_leading_rows", Value.newBuilder().setStringValue(skipLeadingRows).build());
    TransferConfig transferConfig =
        TransferConfig.newBuilder()
            .setDestinationDatasetId(datasetId)
            .setDisplayName("Your Google Cloud Storage Config Name")
            .setDataSourceId("google_cloud_storage")
            .setParams(Struct.newBuilder().putAllFields(params).build())
            .setSchedule("every 24 hours")
            .build();
    createCloudStorageTransfer(projectId, transferConfig);
  }

  public static void createCloudStorageTransfer(String projectId, TransferConfig transferConfig)
      throws IOException {
    try (DataTransferServiceClient client = DataTransferServiceClient.create()) {
      ProjectName parent = ProjectName.of(projectId);
      CreateTransferConfigRequest request =
          CreateTransferConfigRequest.newBuilder()
              .setParent(parent.toString())
              .setTransferConfig(transferConfig)
              .build();
      TransferConfig config = client.createTransferConfig(request);
      System.out.println("Cloud storage transfer created successfully :" + config.getName());
    } catch (ApiException ex) {
      System.out.print("Cloud storage transfer was not created." + ex.toString());
    }
  }
}

Manually triggering a transfer

In addition to automatically scheduled transfers from Cloud Storage, you can manually trigger a transfer to load additional data files.

If the transfer configuration is runtime parameterized, you will need to specify a range of dates for which additional transfers will be started.

To manually trigger a transfer:

Console

  1. Go to the BigQuery page in the Cloud Console.

    Go to the BigQuery page

  2. Click Data transfers.

  3. Click your transfer.

  4. Click RUN TRANSFER NOW or SCHEDULE BACKFILL (for runtime parameterized transfer configurations).

  5. If applicable, choose the Start date and End date, then click OK to confirm.

    RUN TRANSFER NOW

    For runtime parameterized transfer configurations, you will see date options when you click SCHEDULE BACKFILL.

    SCHEDULE BACKFILL

What's next