Introduction to Loading Data from Google Cloud Storage

This page provides an overview of loading data from Cloud Storage into BigQuery.

Overview

When you load data from Google Cloud Storage into BigQuery, your data can be in any of the following formats:

  • Comma-separated values (CSV)
  • JSON (newline-delimited)
  • Avro
  • Parquet
  • ORC (Beta)
  • Google Cloud Datastore backups

BigQuery supports loading data from any of the following Cloud Storage storage classes:

  • Multi-Regional
  • Regional
  • Nearline
  • Coldline

Location considerations

When you choose a location for your data, consider the following:
  • Colocate your BigQuery dataset and your external data source.
    • When you query data in an external data source such as Cloud Storage, the data you're querying must be in the same location as your BigQuery dataset. For example, if your BigQuery dataset is in the EU multi-regional location, the Cloud Storage bucket containing the data you're querying must be in a multi-regional bucket in the EU. If your dataset is in the US multi-regional location, your Cloud Storage bucket must be in a multi-regional bucket in the US.
    • If your dataset is in a regional location, the Cloud Storage bucket containing the data you're querying must be in a regional bucket in the same location. For example, if your dataset is in the Tokyo region, your Cloud Storage bucket must be a regional bucket in Tokyo.
    • If your external dataset is in Cloud Bigtable, your datatset must be in the US or the EU multi-regional location. Your Cloud Bigtable data must be in one of the supported Cloud Bigtable locations.
    • Location considerations do not apply to Google Drive external data sources.
  • Colocate your Cloud Storage buckets for loading data.
    • If your BigQuery dataset is in a multi-regional location, the Cloud Storage bucket containing the data you're loading must be in a regional or multi-regional bucket in the same location. For example, if your BigQuery dataset is in the EU, the Cloud Storage bucket must be in a regional or multi-regional bucket in the EU.
    • If your dataset is in a regional location, your Cloud Storage bucket must be a regional bucket in the same location. For example, if your dataset is in the Tokyo region, your Cloud Storage bucket must be a regional bucket in Tokyo.
    • Exception: If your dataset is in the US multi-regional location, you can load data from a Cloud Storage bucket in any regional or multi-regional location.
  • Colocate your Cloud Storage buckets for exporting data.
    • When you export data, the regional or multi-regional Cloud Storage bucket must be in the same location as the BigQuery dataset. For example, if your BigQuery dataset is in the EU multi-regional location, the Cloud Storage bucket containing the data you're exporting must be in a regional or multi-regional location in the EU.
    • If your dataset is in a regional location, your Cloud Storage bucket must be a regional bucket in the same location. For example, if your dataset is in the Tokyo region, your Cloud Storage bucket must be a regional bucket in Tokyo.
    • Exception: If your dataset is in the US multi-regional location, you can export data into a Cloud Storage bucket in any regional or multi-regional location.
  • Develop a data management plan.
For more information on Cloud Storage locations, see Bucket Locations in the Cloud Storage documentation.

Moving BigQuery data between locations

You cannot change the location of a dataset after it is created. Also, you cannot move a dataset from one location to another. If you need to move a dataset from one location to another, follow this process:

  1. Export the data from your BigQuery tables to a regional or multi-region Cloud Storage bucket in the same location as your dataset. For example, if your dataset is in the EU multi-region location, export your data into a regional or multi-region bucket in the EU.

    There are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.

  2. Copy or move the data from your Cloud Storage bucket to a regional or multi-region bucket in the new location. For example, if you are moving your data from the US multi-region location to the Tokyo regional location, you would transfer the data to a regional bucket in Tokyo. For information on transferring Cloud Storage objects, see Renaming, Copying, and Moving Objects in the Cloud Storage documentation.

    Note that transferring data between regions incurs network egress charges in Cloud Storage.

  3. After you transfer the data to a Cloud Storage bucket in the new location, create a new BigQuery dataset (in the new location). Then, load your data from the Cloud Storage bucket into BigQuery.

    You are not charged for loading the data into BigQuery, but you will incur charges for storing the data in Cloud Storage until you delete the data or the bucket. You are also charged for storing the data in BigQuery after it is loaded. Loading data into BigQuery is subject to the limits on load jobs.

For more information on using Cloud Storage to store and move large datasets, see Using Google Cloud Storage with Big Data.

Retrieving the Google Cloud Storage URI

To load data from a Google Cloud Storage data source, you must provide the Cloud Storage URI.

The Cloud Storage URI comprises your bucket name and your object (filename). For example, if the Cloud Storage bucket is named mybucket and the data file is named myfile.csv, the bucket URI would be gs://mybucket/myfile.csv. If your data is separated into multiple files you can use a wildcard in the URI. For more information, see Cloud Storage Request URIs.

BigQuery does not support source URIs that include multiple consecutive slashes after the initial double slash. Cloud Storage object names can contain multiple consecutive slash ("/") characters. However, BigQuery converts multiple consecutives slashes into a single slash. For example, the following source URI, though valid in Cloud Storage, does not work in BigQuery: gs://[BUCKET]/my//object//name.

To retrieve the Cloud Storage URI:

  1. Open the Cloud Storage web UI.

    Cloud Storage web UI

  2. Browse to the location of the object (file) that contains the source data.

  3. At the top of the Cloud Storage web UI, note the path to the object. To compose the URI, replace gs://[BUCKET]/[FILE] with the appropriate path, for example, gs://mybucket/myfile.json. [BUCKET] is the Cloud Storage bucket name and [FILE] is the name of the object (file) containing the data.

Required permissions

When you load data into BigQuery, you need project or dataset-level permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need access to the bucket that contains your data.

BigQuery permissions

When you are loading data into BigQuery from Cloud Storage, you must be granted the bigquery.dataOwner or bigquery.dataEditor role at the project level or at the dataset level. Both roles grant users and groups permission to load data into a new table or to append to or overwrite an existing table.

Granting the roles at the project level gives the user or group permission to load data into tables in every dataset in the project. Granting the roles at the dataset level gives the user or group the ability to load data only into tables in that dataset.

For more information on configuring dataset access, see Assigning access controls to datasets. For more information on IAM roles in BigQuery, see Access Control.

Cloud Storage permissions

In order to load data from a Cloud Storage bucket, you must be granted storage.objects.get permissions at the project level or on that individual bucket. If you are using a URI wildcard, you must also have storage.objects.list permissions.

The predefined IAM role storage.objectViewer can be granted to provide storage.objects.getand storage.objects.list permissions.

Cloud Storage access and storage logs

Google Cloud Storage provides access and storage log files in CSV format, which can be directly imported into BigQuery for analysis. For more information on loading and analyzing Cloud Storage logs, see Access Logs & Storage Logs in the Cloud Storage documentation.

Wildcard support for Cloud Storage URIs

If your Google Cloud Storage data is separated into multiple files that share a common base-name, you can use a wildcard in the URI when you load the data.

To add a wildcard to the Cloud Storage URI, you append an asterisk (*) to the base-name. For example, if you have two files named fed-sample000001.csv and fed-sample000002.csv, the bucket URI would be gs://mybucket/fed-sample*. This wildcard URI can then be used in the web UI, CLI, or API.

You can use only one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.

For Google Cloud Datastore backups, only one URI can be specified, and it must end with .backup_info or .export_metadata. The * wildcard character is not allowed when creating external tables linked to Cloud Datastore backups or when loading Cloud Datastore backup data from Cloud Storage.

Limitations

You are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket:

  • If your dataset's location is set to a value other than US, the regional or multi-regional Cloud Storage bucket must be in the same region as the dataset.

Next steps

To learn how to load data from Cloud Storage into BigQuery, see the documentation for your data format:

Cette page vous a-t-elle été utile ? Évaluez-la :

Envoyer des commentaires concernant…