Loading Data From Cloud Datastore Exports

Google BigQuery supports loading data from Cloud Datastore exports created using the Cloud Datastore managed import and export service. You can use the managed import and export service to export Cloud Datastore entities into a Cloud Storage bucket. You can then load the export into BigQuery as a table.

To learn how to create a Cloud Datastore export file, see Exporting and Importing Entities in the Cloud Datastore documentation. For information on scheduling exports, see Scheduling an Export.

You can control which properties BigQuery should load by setting the projectionFields property in the API or by using the --projection_fields flag in the CLI.

If you prefer to skip the loading process, you can query the export directly by setting it up as an external data source. For more information, see External Data Sources.

When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi-regional location as the Cloud Storage bucket.

Required permissions

When you load data into BigQuery, you need project or dataset-level permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need access to the bucket that contains your data.

BigQuery permissions

When you are loading data into BigQuery from Cloud Storage, you must be granted the bigquery.dataOwner or bigquery.dataEditor role at the project level or at the dataset level. Both roles grant users and groups permission to load data into a new table or to append to or overwrite an existing table.

Granting the roles at the project level gives the user or group permission to load data into tables in every dataset in the project. Granting the roles at the dataset level gives the user or group the ability to load data only into tables in that dataset.

For more information on configuring dataset access, see Assigning access controls to datasets. For more information on IAM roles in BigQuery, see Access Control.

Cloud Storage permissions

In order to load data from a Cloud Storage bucket, you must be granted storage.objects.get permissions at the project level or on that individual bucket. If you are using a URI wildcard, you must also have storage.objects.list permissions.

The predefined IAM role storage.objectViewer can be granted to provide storage.objects.getand storage.objects.list permissions.

Limitations

When you load data into BigQuery from a Cloud Datastore export, note the following restrictions:

  • You cannot use a wildcard in the Cloud Storage URI when you specify a Cloud Datastore export file.
  • You can specify only one Cloud Storage URI when loading data from Cloud Datastore exports.
  • You cannot append Cloud Datastore export data to an existing table with a defined schema.
  • For a Cloud Datastore export to load correctly, entities in the export data must share a consistent schema.
  • Data exported without specifying an entity filter cannot be loaded into BigQuery. The export request must include one or more kind names in the entity filter.

Loading Cloud Datastore export service data

To load data from a Cloud Datastore export metadata file:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI
  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.
  3. On the Create Table page, in the Source Data section:
    • Leave Create from source selected.
    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating. The URI for your Cloud Datastore export file should end with [KIND_NAME].export_metadata. For example: default_namespace_kind_Book.export_metadata. In this example, Book is the kind name, and default_namespace_kind_Book is the file name generated by Cloud Datastore.

      Verify [KIND_NAME] is specified in your Cloud Storage URI. If you specify the URI without [KIND_NAME], you receive the following error: does not contain valid backup metadata. (error code: invalid).

    • For File format, select Cloud Datastore Backup.
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. The schema is inferred from a Cloud Datastore export.
  6. Select applicable items in the Options section and then click Create Table. For information on the available options, see Cloud Datastore options.

Command-line

Use the bq load command with source_format set to DATASTORE_BACKUP. Supply the --location flag and set the value to your location.

bq --location=[LOCATION] load --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE]

where:

  • [LOCATION] is your location. The --location flag is optional if your data is in the US or the EU multi-region location.For example, if you are using BigQuery in the Tokyo region, set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • [FORMAT] is DATASTORE_BACKUP.
  • [DATASET] is the dataset that contains the table into which you're loading data.
  • [TABLE] is the table into which you're loading data. If the table does not exist, it is created.
  • [PATH_TO_SOURCE] is the Cloud Storage URI.

For example, the following command loads the gs://mybucket/20180228T1256/default_namespace/kind_Book/default_namespace_kind_Book.export_metadata Cloud Datastore export file into a table named book_data. mybucket and mydataset were created in the US multi-region location.

bq --location=US load --source_format=DATASTORE_BACKUP mydataset.book_data gs://mybucket/20180228T1256/default_namespace/kind_Book/default_namespace_kind_Book.export_metadata

API

Set the following properties to load Cloud Datastore export data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. The file (object) name must end in [KIND_NAME].export_metadata. Only one URI is allowed for Cloud Datastore exports, and you cannot use a wildcard.

  4. Specify the data format by setting the configuration.load.sourceFormat property to DATASTORE_BACKUP.

Cloud Datastore Admin backups

If you use the Datastore Admin backup feature to export your Cloud Datastore data, note that the file extension will be .backup_info instead of .export_metadata. When you import your data into BigQuery, you can use either a .backup_info file or a .export_metadata file until the Datastore Admin backups service becomes unavailable.

Appending to or overwriting a table with Cloud Datastore data

When you load Cloud Datastore export data into BigQuery, you can create a new table to store the data, or you can overwrite an existing table. You cannot append Cloud Datastore export data to an existing table.

If you attempt to append Cloud Datastore export data to an existing table, the following error results: Cannot append a datastore backup to a table that already has a schema. Try using the WRITE_TRUNCATE write disposition to replace the existing table.

To overwrite an existing table with Cloud Datastore export data:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI
  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.
  3. On the Create Table page, in the Source Data section:
    • Leave Create from source selected.
    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating. The URI for your Cloud Datastore export file should end with [KIND_NAME].export_metadata. For example: default_namespace_kind_Book.export_metadata. In this example, Book is the kind name, and default_namespace_kind_Book is the file name generated by Cloud Datastore.

      Verify [KIND_NAME] is specified in your Cloud Storage URI. If you specify the URI without [KIND_NAME], you receive the following error: does not contain valid backup metadata. (error code: invalid).

    • For File format, select Cloud Datastore Backup.
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're overwriting.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. The schema is inferred from a Cloud Datastore export.
  6. In the Options section, for Write preference, choose Overwrite table.
  7. Click Create Table.

Command-line

Use the bq load command with the --replace flag and with source_format set to DATASTORE_BACKUP. Supply the --location flag and set the value to your location. .

bq --location=[LOCATION] load --source_format=[FORMAT] --replace [DATASET].[TABLE] [PATH_TO_SOURCE]

where:

  • [LOCATION] is your location. The --location flag is optional if your data is in the US or the EU multi-region location. For example, if you are using BigQuery in the Tokyo region, set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • [FORMAT] is DATASTORE_BACKUP.
  • [DATASET] is the dataset containing the table into which you're loading data.
  • [TABLE] is the table you're overwriting.
  • [PATH_TO_SOURCE] is the Cloud Storage URI.

For example, the following command loads the gs://mybucket/20180228T1256/default_namespace/kind_Book/default_namespace_kind_Book.export_metadata Cloud Datastore export file and overwrites a table named book_data:

bq load --source_format=DATASTORE_BACKUP --replace mydataset.book_data gs://mybucket/20180228T1256/default_namespace/kind_Book/default_namespace_kind_Book.export_metadata

API

Set the following properties to load data from the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. The file (object) name must end in [KIND_NAME].export_metadata. Only one URI is allowed for Cloud Datastore exports, and you cannot use a wildcard.

  4. Specify the data format by setting the configuration.load.sourceFormat property to DATASTORE_BACKUP.

  5. Specify the write disposition by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE.

Cloud Datastore options

To change how BigQuery parses Cloud Datastore export data, specify additional options in the UI, CLI, or API.

CSV option Web UI option CLI flag BigQuery API property Description
Projection fields None --projection_fields projectionFields A comma-separated list that indicates which entity properties to load into BigQuery from a Cloud Datastore export. Property names are case sensitive and must be top-level properties. If no properties are specified, BigQuery loads all properties. If any named property isn't found in the Cloud Datastore export, an invalid error is returned in the job result. The default value is ''.
Number of bad records allowed Number of errors allowed --max_bad_records maxBadRecords (Optional) The maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an invalid error is returned in the job result. The default value is 0, which requires that all records are valid.

Data type conversion

BigQuery converts data from each entity in Cloud Datastore export files to BigQuery's data types. The following table describes the conversion between data types.

Cloud Datastore data type BigQuery data type
Blob BigQuery discards these values when loading the data.
Blobstore key STRING
Boolean BOOLEAN
Category STRING
Datastore key RECORD
Date and time TIMESTAMP
Email STRING
Embedded entity RECORD
Floating-point number FLOAT
Geographical point

RECORD

[{"lat","DOUBLE"},
 {"long","DOUBLE"}]
        
IM handle STRING
Integer INTEGER
Link STRING
Phone number STRING
Postal address STRING
Rating INTEGER
Short blob BigQuery discards these values when loading the data.
String STRING (truncated to 64 KB)
User

RECORD

[{"email","STRING"}
 {"userid","STRING"}]
        

Datastore key properties

Each entity in Cloud Datastore has a unique key that contains information such as the namespace and the path. BigQuery creates a RECORD data type for the key, with nested fields for each piece of information, as described in the following table.

Key property Description BigQuery data type
__key__.app The Cloud Datastore app name. STRING
__key__.id The entity's ID, or null if __key__.name is set. INTEGER
__key__.kind The entity's kind. STRING
__key__.name The entity's name, or null if __key__.id is set. STRING
__key__.namespace If the Cloud Datastore app uses a custom namespace, the entity's namespace. Else, the default namespace is represented by an empty string. STRING
__key__.path The flattened ancestral path of the entity, consisting of the sequence of kind-identifier pairs from the root entity to the entity itself. For example: "Country", "USA", "PostalCode", 10011, "Route", 1234. STRING

Send feedback about...