Creating datasets

This document describes how to create datasets in BigQuery.

You can create datasets in the following ways:

  • Using the console.
  • Using a SQL query.
  • Using the bq mk command in the bq command-line tool.
  • Calling the datasets.insert API method.
  • Using the client libraries.
  • Copying an existing dataset.

To see steps for copying a dataset, including across regions, see Copying datasets.

To learn to query tables in a public dataset, see Query a public dataset with the Google Cloud console.

Dataset limitations

BigQuery datasets are subject to the following limitations:

  • You can set the geographic location at creation time only. After a dataset has been created, the location becomes immutable and can't be changed by using the console, using the bq command-line tool, or calling the patch or update API methods.
  • All tables that are referenced in a query must be stored in datasets in the same location.

  • When you copy a table, the datasets that contain the source table and destination table must reside in the same location.

  • Dataset names must be unique for each project.

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document.

Required permissions

To create a dataset, you need the bigquery.datasets.create IAM permission.

Each of the following predefined IAM roles includes the permissions that you need in order to create a dataset:

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.user
  • roles/bigquery.admin

For more information about IAM roles in BigQuery, see Predefined roles and permissions.

Name datasets

When you create a dataset in BigQuery, the dataset name must be unique for each project. The dataset name can contain the following:

  • Up to 1,024 characters.
  • Letters (uppercase or lowercase), numbers, and underscores.

Dataset names are case-sensitive: mydataset and MyDataset can coexist in the same project.

Dataset names cannot contain spaces or special characters such as -, &, @, or %.

Create datasets

To create a dataset:

Console

  1. Open the BigQuery page in the console.

    Go to the BigQuery page

  2. In the Explorer panel, select the project where you want to create the dataset.

  3. Expand the Actions option and click Create dataset.

  4. On the Create dataset page:

    • For Dataset ID, enter a unique dataset name.
    • For Data location, choose a geographic location for the dataset. After a dataset is created, the location can't be changed.

    • For Default table expiration, choose one of the following options:

      • Never: (Default) Tables created in the dataset are never automatically deleted. You must delete them manually.
      • Number of days after table creation: This value determines when a newly created table in the dataset is deleted. This value is applied if you do not set a table expiration when the table is created.

    • Click Create dataset.

SQL

Use the CREATE SCHEMA statement.

To create a dataset in a project other than your default project, add the project ID to the dataset ID in the following format: PROJECT_ID.DATASET_ID.

  1. In the console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE SCHEMA PROJECT_ID.DATASET_ID
      OPTIONS (
        default_kms_key_name = 'KMS_KEY_NAME',
        default_partition_expiration_days = PARTITION_EXPIRATION,
        default_table_expiration_days = TABLE_EXPIRATION,
        description = 'DESCRIPTION',
        labels = [('LABEL_1','VALUE_1'),('LABEL_2','VALUE_2')],
        max_time_travel_hours = HOURS);
    

    Replace the following:

    • PROJECT_ID: your project ID
    • DATASET_ID: the ID of the dataset that you're creating
    • KMS_KEY_NAME: the name of the default Cloud Key Management Service key used to protect newly created tables in this dataset unless a different key is supplied at the time of creation. You cannot create a Google-encrypted table in a dataset with this parameter set.
    • PARTITION_EXPIRATION: the default lifetime (in seconds) for partitions in newly created partitioned tables. The default partition expiration has no minimum value. The expiration time evaluates to the partition's date plus the integer value. Any partition created in a partitioned table in the dataset is deleted PARTITION_EXPIRATION seconds after the partition's date. If you supply the time_partitioning_expiration option when you create or update a partitioned table, the table-level partition expiration takes precedence over the dataset-level default partition expiration.
    • TABLE_EXPIRATION: the default lifetime (in seconds) for newly created tables. The minimum value is 3600 seconds (one hour). The expiration time evaluates to the current time plus the integer value. Any table created in the dataset is deleted TABLE_EXPIRATION seconds after its creation time. This value is applied if you do not set a table expiration when you create the table.
    • DESCRIPTION: a description of the dataset
    • LABEL_1:VALUE_1: the key-value pair that you want to set as the first label on this dataset
    • LABEL_2:VALUE_2: the key-value pair that you want to set as the second label
    • HOURS: the duration in hours of the time travel window for the new dataset. The ability to configure the time travel window is in preview. The hours field must be a value between 48 and 168. If not specified, this defaults to 168.

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

bq

To create a new dataset, use the bq mk command with the --location flag.

To create a dataset in a project other than your default project, add the project ID to the dataset name in the following format: PROJECT_ID:DATASET_ID.

bq --location=LOCATION mk \
    --dataset \
    --default_kms_key=KMS_KEY_NAME \
    --default_partition_expiration=PARTITION_EXPIRATION \
    --default_table_expiration=TABLE_EXPIRATION \
    --description="DESCRIPTION" \
    --label=LABEL_1:VALUE_1 \
    --label=LABEL_2:VALUE_2 \
    --max_time_travel_hours=HOURS \
    PROJECT_ID:DATASET_ID

Replace the following:

  • LOCATION: the dataset's location. After a dataset is created, the location can't be changed. You can set a default value for the location by using the .bigqueryrc file.

  • KMS_KEY_NAME: the name of the default Cloud Key Management Service key used to protect newly created tables in this dataset unless a different key is supplied at the time of creation. You cannot create a Google-encrypted table in a dataset with this parameter set.

  • PARTITION_EXPIRATION: the default lifetime (in seconds) for partitions in newly created partitioned tables. The default partition expiration has no minimum value. The expiration time evaluates to the partition's date plus the integer value. Any partition created in a partitioned table in the dataset is deleted PARTITION_EXPIRATION seconds after the partition's date. If you supply the --time_partitioning_expiration flag when you create or update a partitioned table, the table-level partition expiration takes precedence over the dataset-level default partition expiration.

  • TABLE_EXPIRATION: the default lifetime (in seconds) for newly created tables. The minimum value is 3600 seconds (one hour). The expiration time evaluates to the current time plus the integer value. Any table created in the dataset is deleted TABLE_EXPIRATION seconds after its creation time. This value is applied if you do not set a table expiration when you create the table.

  • DESCRIPTION: a description of the dataset

  • LABEL_1:VALUE_1: the key-value pair that you want to set as the first label on this dataset, and LABEL_2:VALUE_2 is the key-value pair that you want to set as the second label.

  • HOURS: the duration in hours of the time travel window for the new dataset. The ability to configure the time travel window is in preview. The HOURS field must be a value between 48 and 168. If not specified, this defaults to 168.

  • PROJECT_ID: your project ID.

  • DATASET_ID is the ID of the dataset that you're creating.

For example, the following command creates a dataset named mydataset with data location set to US, a default table expiration of 3600 seconds (1 hour), and a description of This is my dataset. Instead of using the --dataset flag, the command uses the -d shortcut. If you omit -d and --dataset, the command defaults to creating a dataset.

bq --location=US mk -d \
    --default_table_expiration 3600 \
    --description "This is my dataset." \
    mydataset

To confirm that the dataset was created, enter the bq ls command. Also, you can create a table when you create a new dataset using the following format: bq mk -t dataset.table. For more information about creating tables, see Creating a table.

API

Call the datasets.insert method with a defined dataset resource.

C#

Before trying this sample, follow the C# setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery C# API reference documentation.


using Google.Apis.Bigquery.v2.Data;
using Google.Cloud.BigQuery.V2;

public class BigQueryCreateDataset
{
    public BigQueryDataset CreateDataset(
        string projectId = "your-project-id",
        string location = "US"
    )
    {
        BigQueryClient client = BigQueryClient.Create(projectId);
        var dataset = new Dataset
        {
            // Specify the geographic location where the dataset should reside.
            Location = location
        };
        // Create the dataset
        return client.CreateDataset(
            datasetId: "your_new_dataset_id", dataset);
    }
}

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

import (
	"context"
	"fmt"

	"cloud.google.com/go/bigquery"
)

// createDataset demonstrates creation of a new dataset using an explicit destination location.
func createDataset(projectID, datasetID string) error {
	// projectID := "my-project-id"
	// datasetID := "mydataset"
	ctx := context.Background()

	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	meta := &bigquery.DatasetMetadata{
		Location: "US", // See https://cloud.google.com/bigquery/docs/locations
	}
	if err := client.Dataset(datasetID).Create(ctx, meta); err != nil {
		return err
	}
	return nil
}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.Dataset;
import com.google.cloud.bigquery.DatasetInfo;

public class CreateDataset {

  public static void runCreateDataset() {
    // TODO(developer): Replace these variables before running the sample.
    String datasetName = "MY_DATASET_NAME";
    createDataset(datasetName);
  }

  public static void createDataset(String datasetName) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      DatasetInfo datasetInfo = DatasetInfo.newBuilder(datasetName).build();

      Dataset newDataset = bigquery.create(datasetInfo);
      String newDatasetName = newDataset.getDatasetId().getDataset();
      System.out.println(newDatasetName + " created successfully");
    } catch (BigQueryException e) {
      System.out.println("Dataset was not created. \n" + e.toString());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

// Import the Google Cloud client library and create a client
const {BigQuery} = require('@google-cloud/bigquery');
const bigquery = new BigQuery();

async function createDataset() {
  // Creates a new dataset named "my_dataset".

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const datasetId = "my_new_dataset";

  // Specify the geographic location where the dataset should reside
  const options = {
    location: 'US',
  };

  // Create a new dataset
  const [dataset] = await bigquery.createDataset(datasetId, options);
  console.log(`Dataset ${dataset.id} created.`);
}
createDataset();

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery PHP API reference documentation.

use Google\Cloud\BigQuery\BigQueryClient;

/** Uncomment and populate these variables in your code */
// $projectId = 'The Google project ID';
// $datasetId = 'The BigQuery dataset ID';

$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$dataset = $bigQuery->createDataset($datasetId);
printf('Created dataset %s' . PHP_EOL, $datasetId);

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set dataset_id to the ID of the dataset to create.
# dataset_id = "{}.your_dataset".format(client.project)

# Construct a full Dataset object to send to the API.
dataset = bigquery.Dataset(dataset_id)

# TODO(developer): Specify the geographic location where the dataset should reside.
dataset.location = "US"

# Send the dataset to the API for creation, with an explicit timeout.
# Raises google.api_core.exceptions.Conflict if the Dataset already
# exists within the project.
dataset = client.create_dataset(dataset, timeout=30)  # Make an API request.
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Ruby API reference documentation.

require "google/cloud/bigquery"

def create_dataset dataset_id = "my_dataset", location = "US"
  bigquery = Google::Cloud::Bigquery.new

  # Create the dataset in a specified geographic location
  bigquery.create_dataset dataset_id, location: location

  puts "Created dataset: #{dataset_id}"
end

Dataset security

To control access to datasets in BigQuery, see Controlling access to datasets. For information about data encryption, see Encryption at rest.

Next steps

Try it for yourself

If you're new to Google Cloud, create an account to evaluate how BigQuery performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Try BigQuery free