Using Datasets

This document describes how to create, update, and manage datasets in BigQuery.

Overview

A dataset is contained within a specific project. Datasets enable you to organize and control access to your tables and views. A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.

Dataset limitations

BigQuery datasets are subject to the following limitations:

  • You can set the geographic location at creation time only. After a dataset has been created, the location becomes immutable and can't be changed by the patch or update API methods.
  • All tables referenced in a query must be stored in datasets in the same location.
  • You can stream data into a US or EU dataset, but inserting data across these locations simultaneously can increase latency and error rates.
  • When copying a table, the destination datasets must reside in the same location.
  • If you export data from Google Analytics to BigQuery, the destination dataset must be in the US.
  • If a dataset has 50,000 or more tables, it will become slower to enumerate them, whether through an API call, the web UI, or querying __TABLES_SUMMARY__.
  • Currently, no limit exists on the number of datasets per project. However, as you approach thousands of datasets in a project, web UI performance begins to degrade.

Creating datasets

When you create a dataset in BigQuery, the dataset name must be unique per project. The dataset name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

Dataset names cannot:

  • Begin with a number or an underscore
  • Contain spaces or special characters (such as &, @, or %)

Creating a dataset requires bigquery.datasets.create permissions. The following predefined IAM roles include bigquery.datasets.create permissions:

For more information on IAM roles and permissions in BigQuery, see access control.

To create a dataset:

Web UI

To create a dataset using the web UI:

  1. Click the down arrow icon down arrow icon next to your project name in the navigation and click Create new dataset.

  2. In the Create Dataset dialog:

    • For Dataset ID, enter a unique dataset name.
    • For Data location, choose one of the following options:

      • Unspecified (default): You have no preference on data location.
      • US: BigQuery attempts to store your data in the US.
      • EU: Your Core BigQuery Customer Data resides in the EU.

        Core BigQuery Customer Data is defined in the Service Specific Terms.

    • For Data expiration, choose one of the following options:

      • Never: Any table created in the dataset is never automatically deleted. You must delete them manually.
      • In [INTEGER] days: Any table created in the dataset is deleted after [INTEGER] days.

        Create dataset

    • Click OK.

Command-line

Use the bq mk command to create a new dataset. Optional parameters include --data_location, --default_table_expiration, and --description:

bq mk --data_location [LOCALE] --default_table_expiration [INTEGER] --description [DESCRIPTION] [DATASET]

Where:

  • [DATASET] is the name of the dataset you're creating.
  • [LOCALE] is the data location, either US or EU. If omitted, the default value is "Unspecified".
  • [INTEGER] is the default lifetime (in seconds) for newly-created tables. The minimum value is 3600 seconds (one hour). The expiration time evaluates to the current time plus the integer value.
  • [DESCRIPTION] is a description of the dataset in quotes.

For example, the following command creates a dataset named mydataset with data location set to US, a default table expiration of 3600 seconds (1 hour), and a description of This is my dataset.

bq mk --data_location US --default_table_expiration 3600 --description "This is my dataset." mydataset

API

Call datasets.insert and pass any relevant parameters.

When you create a dataset, you can also apply labels to it. For more information on dataset labels, see Using Labels.

Assigning access controls to datasets

You share access to BigQuery tables and views using project-level IAM roles and dataset-level access controls. Currently, you cannot apply access controls directly to tables or views.

Project-level access controls determine the users and groups allowed to access all datasets, tables, and views within a project. Dataset access controls determine the users (including service accounts) and groups allowed to access the tables and views in a specific dataset.

For example, if you assign the bigquery.dataOwner role to a user at the project level, that user can create, update, get, and delete tables and views in all of the project's datasets. If you assign the bigquery.dataOwner role at the dataset level, the user can create, update, get, and delete tables and views only in that dataset.

If you assign users or groups to a more restrictive role at the project level, you must also grant access to individual datasets. For example, if you grant a user or group the bigquery.user role at the project level, by default, no access is granted to any of the datasets, tables, or views in the project. bigquery.user gives users the ability to create their own datasets and to run query jobs against datasets they have been given access to. If you assign the bigquery.user role, you must also assign access controls to each dataset the user or group needs to access that wasn't created by the user.

For more information about project-level IAM roles and dataset access controls, see access control.

You can apply access controls to a dataset after the dataset is created using the command-line tool and the web UI. You can apply access controls during or after dataset creation using the API.

To assign access controls to a dataset:

Web UI

  1. Click the drop-down arrow to the right of the dataset and choose Share Dataset.

  2. In the Share Dataset dialog, for Add People, click the drop-down to the left of the field, and choose the appropriate option:

    • User by e-mail - Gives an individual Google account access to the dataset
    • Group by e-mail - Gives all members of a Google group access to the dataset
    • Domain - Gives all users and groups in a Google domain access to the dataset
    • All Authenticated Users - Gives all Google account holders access to the dataset (makes the dataset public)
    • Project Owners - Gives all project owners access to the dataset
    • Project Viewers - Gives all project viewers access to the dataset
    • Project Editors - Gives all project editors access to the dataset
    • Authorized View - Gives a view access to the dataset

  3. Type a value in the text box. For example, if you chose User by e-mail, type the user's email address.

  4. To the right of the Add People field, click Can view and choose the appropriate role from the list.

    Add people to dataset

  5. Click Add and then click Save changes.

Command-line

  1. Export the existing access controls to a JSON file using the show command.

    bq --format=json show [DATASET] >[DATASET].json
    

    Where:

    • [DATASET] is the name of your dataset.
  2. Make your changes to the "access" section of the JSON file. You can add or remove any of the special groups: Project Owners, Project Writers, Project Readers, and All Authenticated Users. You can also add any of the following: User by e-mail, Group by e-mail, and Domain.

    For example:

    {
     "access": [
      {
       "role": "READER",
       "specialGroup": "projectReaders"
      },
      {
       "role": "WRITER",
       "specialGroup": "projectWriters"
      },
      {
       "role": "OWNER",
       "specialGroup": "projectOwners"
      }
      {
       "role": "READER",
       "specialGroup": "allAuthenticatedUsers"
      }
      {
       "role": "READER",
       "domain": "[DOMAIN_NAME]"
      }
      {
       "role": "WRITER",
       "userByEmail": "[USER_EMAIL]"
      }
      {
       "role": "READER",
       "groupByEmail": "[GROUP_EMAIL]"
      }
     ],
    }
    

  3. When your edits are complete, use the update command and include the JSON file using the --source flag.

    bq update --source=[DATASET].json [DATASET]

    Where:

    • [DATASET] is the name of your dataset.
  4. To verify your access control changes, use the show command.

    bq show [DATASET]

    Where:

    • [DATASET] is the name of your dataset.

API


Call datasets.insert or datasets.update and use the access[] property to apply your access controls. For more information, see Datasets.

Getting information about datasets

You can get information about datasets to which you have access using the web UI, the CLI, and the API.

Getting information about datasets requires bigquery.datasets.get permissions. All BigQuery IAM roles include bigquery.datasets.get permissions except for bigquery.jobUser.

For more information on IAM roles and permissions in BigQuery, see Access control.

To get information about datasets:

Web UI

Click the dataset name. The Dataset Details page displays the dataset's description, details, and tables.

View dataset

CLI

Issue the bq show command:

bq show --format=prettyjson [DATASET]

Where:

  • [DATASET] is the name of the dataset.

For example, the following command displays information about mydataset.

bq show --format=prettyjson mydataset

API

Call the bigquery.datasets.get API method and provide any relevant parameters.

Listing datasets

You can list datasets to which you have access using the CLI bq ls command or by calling the bigquery.datasets.list API method.

You can list datasets to which you have access using the web UI, the CLI, and the API.

Listing datasets requires bigquery.datasets.list permissions. All BigQuery IAM roles include bigquery.datasets.list permissions except for bigquery.jobUser.

For more information on IAM roles and permissions in BigQuery, see Access control.

To list the datasets in a project:

Web UI

Datasets are listed by project in the web UI's navigation pane.

Command-line

Issue the bq ls command to list your datasets by dataset ID.

API

To list datasets using the API, call the bigquery.datasets.list API method.

C#

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

public List<BigQueryDataset> ListDatasets(BigQueryClient client)
{
    var datasets = client.ListDatasets().ToList();
    return datasets;
}

Go

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

it := client.Datasets(ctx)
for {
	dataset, err := it.Next()
	if err == iterator.Done {
		break
	}
	fmt.Println(dataset.DatasetID)
}

Java

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

Page<Dataset> datasets = bigquery.listDatasets(DatasetListOption.pageSize(100));
for (Dataset dataset : datasets.iterateAll()) {
  // do something with the dataset
}
Page<Dataset> datasets = bigquery.listDatasets(projectId, DatasetListOption.pageSize(100));
for (Dataset dataset : datasets.iterateAll()) {
  // do something with the dataset
}

Node.js

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

// Imports the Google Cloud client library
const BigQuery = require('@google-cloud/bigquery');

// The project ID to use, e.g. "your-project-id"
// const projectId = "your-project-id";

// Instantiates a client
const bigquery = BigQuery({
  projectId: projectId
});

// Lists all datasets in the specified project
bigquery.getDatasets()
  .then((results) => {
    const datasets = results[0];
    console.log('Datasets:');
    datasets.forEach((dataset) => console.log(dataset.id));
  })
  .catch((err) => {
    console.error('ERROR:', err);
  });

PHP

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

use Google\Cloud\BigQuery\BigQueryClient;

/**
 * @param string $projectId The Google project ID.
 */
function list_datasets($projectId)
{
    $bigQuery = new BigQueryClient([
        'projectId' => $projectId,
    ]);
    $datasets = $bigQuery->datasets();
    foreach ($datasets as $dataset) {
        print($dataset->id() . PHP_EOL);
    }
}

Python

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

def list_datasets(project=None):
    """Lists all datasets in a given project.

    If no project is specified, then the currently active project is used.
    """
    bigquery_client = bigquery.Client(project=project)

    for dataset in bigquery_client.list_datasets():
        print(dataset.dataset_id)

Ruby

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

# project_id = "Your Google Cloud project ID"

require "google/cloud/bigquery"

bigquery = Google::Cloud::Bigquery.new project: project_id

bigquery.datasets.each do |dataset|
  puts dataset.dataset_id
end

Using meta-tables

BigQuery offers some special tables whose contents represent metadata, such as the names of your tables. The "meta-tables" are read-only. Normally, you use one by referencing it in a SELECT statement.

Meta-tables can be used in other API operations besides a query job, such as tables.get or tabledata.list. Meta-tables do not support tables.insert and cannot be used as a destination table. Meta-tables also do not support table decorators in legacy SQL, and they do not appear when you run tables.list on a dataset.

The primary meta-table you can reference is: __TABLES_SUMMARY__. Generally, __TABLES_SUMMARY__ is reasonably fast for datasets with up to a few thousand tables. For larger datasets __TABLES_SUMMARY__ becomes increasingly slow, and may exceed available resources.

Metadata about tables in a dataset

You can access metadata about the tables in a dataset by using the __TABLES_SUMMARY__ meta-table.

Use the following syntax to query a meta-table:

    SELECT [FIELD] FROM [DATASET].__TABLES_SUMMARY__;

Where DATASET is the name of your dataset, and FIELD is one of the following:

Field Description
project_id Name of the project.
dataset_id Name of the dataset.
table_id Name of the table.
creation_time The time at which the table was created, in milliseconds since January 1, 1970 UTC.
type An integer representing the table type: a regular table (1) or a view (2).

Example

The following query retrieves metadata about the tables in the publicdata:samples dataset.

    SELECT * FROM publicdata:samples.__TABLES_SUMMARY__;

Returns:

+------------+------------+-----------------+---------------+------+
| project_id | dataset_id |    table_id     | creation_time | type |
+------------+------------+-----------------+---------------+------+
| publicdata | samples    | github_nested   | 1348782587310 |    1 |
| publicdata | samples    | github_timeline | 1335915950690 |    1 |
| publicdata | samples    | gsod            | 1335916040125 |    1 |
| publicdata | samples    | natality        | 1335916045005 |    1 |
| publicdata | samples    | shakespeare     | 1335916045099 |    1 |
| publicdata | samples    | trigrams        | 1335916127449 |    1 |
| publicdata | samples    | wikipedia       | 1335916132870 |    1 |
+------------+------------+-----------------+---------------+------+

Updating datasets

You can modify the following dataset properties by using the web UI, CLI, or API:

  • The default expiration time for new tables
  • The dataset description
  • The dataset labels

Updating datasets requires bigquery.datasets.update permissions. The following predefined IAM roles include bigquery.datasets.update permissions:

For more information on IAM roles and permissions in BigQuery, see Access control.

Updating dataset descriptions

To update the dataset description:

Web UI

You cannot add a description when you create a dataset using the web UI. After the dataset is created, you can add a description on the Dataset Details page.

  1. In the navigation pane, select your dataset.

  2. On the Dataset Details page, in the Description section, click Describe this dataset to open the description box.

  3. Enter a description in the box. When you click away from the box, the text is saved.

    Dataset description

CLI

Issue the bq update command with the --description flag:

bq update --description "[DESCRIPTION]" [DATASET]

Where:

  • [DESCRIPTION] is the text that describes the dataset in quotes.
  • [DATASET] is the name of the dataset you're updating.

For example, the following command changes the description of mydataset to "Description of mydataset."

bq update --description "Description of mydataset" mydataset

API

Call datasets.update and use the description property to apply your dataset description.

Updating default table expiration times

To update the default expiration time for newly created tables in a dataset:

Web UI

To update the default expiration time using the web UI:

  1. In the navigation pane, select your dataset.

  2. On the Dataset Details page, in the Details section, to the right of Default Table Expiration, click Edit.

    Table expiration

  3. In the Update Expiration dialog, for Data expiration, click In and enter the expiration time in days. The default value is Never.

CLI

To update the default lifetime for newly created tables in a dataset, enter the bq update command with the --default_table_expiration flag:

bq update --default_table_expiration [INTEGER] [DATASET]

Where:

  • [INTEGER] is the default lifetime (in seconds) for newly-created tables. The minimum value is 3600 seconds (one hour). The expiration time evaluates to the current time plus the integer value. Specify 0 to remove the existing expiration time.
  • [DATASET] is the name of the dataset you're updating.

For example, the following command sets the default table expiration for tables created in mydataset to two hours from the current time:

bq update --default_table_expiration 7200 mydataset

API

Call datasets.update and use the defaultTableExpirationMs property to apply your expiration time in milliseconds.

Renaming datasets

Currently, you cannot change the name of an existing dataset. If you need to change the dataset name, follow these steps:

  1. Create a new dataset.

  2. Copy the tables from the old dataset to the new one.

  3. Save copies of your views to the new dataset.

  4. Delete the old dataset.

Deleting datasets

When you delete a dataset using the web UI, tables in the dataset (and the data they contain) are deleted. When you delete a dataset using the CLI, you must use the -r flag to delete the dataset's tables.

Deleting a dataset requires bigquery.datasets.delete permissions. If the dataset contains tables, bigquery.tables.delete is also required. The following predefined IAM roles include both bigquery.datasets.delete and bigquery.tables.delete permissions:

For more information on IAM roles and permissions in BigQuery, see access control.

To delete a dataset:

Web UI

To delete a dataset using the web UI:

  1. Click the down arrow icon down arrow icon next to your dataset name in the navigation, and then click Delete dataset.

  2. In the Delete Dataset dialog:

    • For Dataset ID, enter the name of the dataset to delete.
    • Click OK.

      Delete dataset

Command-line

Use the bq rm command with the -d flag to delete a dataset. When you use the CLI to remove a dataset, you must confirm the command. You can use the -f flag to skip confirmation. In addition, if the dataset contains tables, you must use the -r flag to remove all tables in the dataset:

bq rm -r -f -d [DATASET]

Where:

  • [DATASET] is the name of the dataset you're deleting.

For example, the following command removes mydataset and all the tables in it:

bq rm -r -d mydataset

API

Call datasets.delete to delete the dataset and pass the deleteContents parameter to delete the tables in it.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...