Batch loading data

You can load data into BigQuery from Cloud Storage or from a local file as a batch operation. The source data can be in any of the following formats:

  • Avro
  • Comma-separated values (CSV)
  • JSON (newline-delimited)
  • ORC
  • Parquet
  • Firestore exports stored in Cloud Storage.

You can also use BigQuery Data Transfer Service to set up recurring loads from Cloud Storage into BigQuery.

Required permissions

When you load data into BigQuery, you need permissions to run a load job and permissions that let you load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need permissions to access to the bucket that contains your data.

BigQuery permissions

At a minimum, the following permissions are required to load data into BigQuery. These permissions are required if you are loading data into a new table or partition, or if you are appending or overwriting a table or partition.

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.jobs.create

The following predefined IAM roles include both bigquery.tables.create and bigquery.tables.updateData permissions:

  • bigquery.dataEditor
  • bigquery.dataOwner
  • bigquery.admin

The following predefined IAM roles include bigquery.jobs.create permissions:

  • bigquery.user
  • bigquery.jobUser
  • bigquery.admin

In addition, if a user has bigquery.datasets.create permissions, when that user creates a dataset, they are granted bigquery.dataOwner access to it. bigquery.dataOwner access lets the user create and update tables in the dataset by using a load job.

For more information on IAM roles and permissions in BigQuery, see Access control.

Cloud Storage permissions

To load data from a Cloud Storage bucket, you must be granted storage.objects.get permissions. If you are using a URI wildcard, you must also have storage.objects.list permissions.

The predefined IAM role storage.objectViewer can be granted to provide both storage.objects.get and storage.objects.list permissions.

Loading data from Cloud Storage

BigQuery supports loading data from any of the following Cloud Storage storage classes:

  • Standard
  • Nearline
  • Coldline
  • Archive

To learn how to load data into BigQuery, see the page for your data format:

To learn how to configure a recurring load from Cloud Storage into BigQuery, see Cloud Storage transfers.

Location considerations

When you choose a location for your data, consider the following:

  • Colocate your Cloud Storage buckets for loading data.
    • If your BigQuery dataset is in a multi-regional location, the Cloud Storage bucket containing the data you're loading must be in a regional or multi-regional bucket in the same location. For example, if your BigQuery dataset is in the EU, the Cloud Storage bucket must be in a regional or multi-regional bucket in the EU.
    • If your dataset is in a regional location, your Cloud Storage bucket must be a regional bucket in the same location. For example, if your dataset is in the Tokyo region, your Cloud Storage bucket must be a regional bucket in Tokyo.
    • Exception: If your dataset is in the US multi-regional location, you can load data from a Cloud Storage bucket in any regional or multi-regional location.
  • Develop a data management plan.

For more information on Cloud Storage locations, see Bucket locations in the Cloud Storage documentation.

You cannot change the location of a dataset after it is created, but you can make a copy of the dataset or manually move it. For more information, see:

Retrieving the Cloud Storage URI

To load data from a Cloud Storage data source, you must provide the Cloud Storage URI.

The Cloud Storage URI comprises your bucket name and your object (filename). For example, if the Cloud Storage bucket is named mybucket and the data file is named myfile.csv, the bucket URI would be gs://mybucket/myfile.csv. If your data is separated into multiple files you can use a wildcard in the URI. For more information, see Cloud Storage Request URIs.

BigQuery does not support source URIs that include multiple consecutive slashes after the initial double slash. Cloud Storage object names can contain multiple consecutive slash ("/") characters. However, BigQuery converts multiple consecutive slashes into a single slash. For example, the following source URI, though valid in Cloud Storage, does not work in BigQuery: gs://bucket/my//object//name.

To retrieve the Cloud Storage URI:

  1. Open the Cloud Storage console.

    Cloud Storage console

  2. Browse to the location of the object (file) that contains the source data.

  3. At the top of the Cloud Storage console, note the path to the object. To compose the URI, replace gs://bucket/file with the appropriate path, for example, gs://mybucket/myfile.json. bucket is the Cloud Storage bucket name and file is the name of the object (file) containing the data.

Wildcard support for Cloud Storage URIs

If your Cloud Storage data is separated into multiple files that share a common base-name, you can use a wildcard in the URI when you load the data.

To add a wildcard to the Cloud Storage URI, you append an asterisk (*) to the base-name. For example, if you have two files named fed-sample000001.csv and fed-sample000002.csv, the bucket URI is gs://mybucket/fed-sample*. You can then use this wildcard URI in the Cloud Console, the bq command-line tool, the API, or the client libraries.

You can use only one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.

For Google Datastore exports, only one URI can be specified, and it must end with .backup_info or .export_metadata.

The asterisk wildcard character isn't allowed when you do the following:

  • Create external tables linked to Datastore or Firestore exports.
  • Load Datastore or Firestore export data from Cloud Storage.

Limitations

You are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket:

  • If your dataset's location is set to a value other than US, the regional or multi-regional Cloud Storage bucket must be in the same region as the dataset.
  • BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.

Depending on the format of your Cloud Storage source data, there may be additional limitations. For more information, see:

Loading data from local files

You can load data from a readable data source (such as your local machine) by using one of the following:

  • The Google Cloud Console
  • The bq command-line tool's bq load command
  • The API
  • The client libraries

When you load data using the Cloud Console or the bq command-line tool, a load job is automatically created.

To load data from a local data source:

Console

  1. Open the BigQuery page in the Cloud Console.

    Go to the BigQuery page

  2. In the navigation panel, in the Resources section, expand your Google Cloud project and select a dataset.

  3. On the right side of the window, in the details panel, click Create table. The process for loading data is the same as the process for creating an empty table.

  4. On the Create table page, in the Source section:

    • For Create table from, select Upload.

      Upload table.

    • Below Select file click Browse.

      Browse files.

    • Browse to the file, and click Open. Note that wildcards and comma-separated lists are not supported for local files.

    • For File format, select CSV, JSON (newline delimited), Avro, Parquet, or ORC.

  5. On the Create table page, in the Destination section:

    • For Dataset name, choose the appropriate dataset.

      View dataset.

    • In the Table name field, enter the name of the table you're creating in BigQuery.

    • Verify that Table type is set to Native table.

  6. In the Schema section, enter the schema definition.

    • For CSV and JSON files, you can check the Auto-detect option to enable schema auto-detect. Schema information is self-described in the source data for other supported file types.

    • You can also enter schema information manually by:

      • Clicking Edit as text and entering the table schema as a JSON array:

      • Using Add Field to manually input the schema.

  7. Select applicable items in the Advanced options section For information on the available options, see CSV options and JSON options.

  8. Optional: In the Advanced options choose the write disposition:

    • Write if empty: Write the data only if the table is empty.
    • Append to table: Append the data to the end of the table. This setting is the default.
    • Overwrite table: Erase all existing data in the table before writing the new data.
  9. Click Create Table.

bq

Use the bq load command, specify the source_format, and include the path to the local file.

(Optional) Supply the --location flag and set the value to your location.

If you are loading data in a project other than your default project, add the project ID to the dataset in the following format: PROJECT_ID:DATASET.

bq --location=LOCATION load \
--source_format=FORMAT \
PROJECT_ID:DATASET.TABLE \
PATH_TO_SOURCE \
SCHEMA

Replace the following:

  • LOCATION: your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, set the flag's value to asia-northeast1. You can set a default value for the location by using the .bigqueryrc file.
  • FORMAT: CSV, AVRO, PARQUET, ORC, or NEWLINE_DELIMITED_JSON.
  • project_id: your project ID.
  • dataset: an existing dataset.
  • table: the name of the table into which you're loading data.
  • path_to_source: the path to the local file.
  • schema: a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

In addition, you can add flags for options that let you control how BigQuery parses your data. For example, you can use the --skip_leading_rows flag to ignore header rows in a CSV file. For more information, see CSV options and JSON options.

Examples:

The following command loads a local newline-delimited JSON file (mydata.json) into a table named mytable in mydataset in your default project. The schema is defined in a local schema file named myschema.json.

    bq load \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    ./mydata.json \
    ./myschema.json

The following command loads a local CSV file (mydata.csv) into a table named mytable in mydataset in myotherproject. The schema is defined inline in the format FIELD:DATA_TYPE, FIELD:DATA_TYPE.

    bq load \
    --source_format=CSV \
    myotherproject:mydataset.mytable \
    ./mydata.csv \
    qtr:STRING,sales:FLOAT,year:STRING

The following command loads a local CSV file (mydata.csv) into a table named mytable in mydataset in your default project. The schema is defined using schema auto-detection.

    bq load \
    --autodetect \
    --source_format=CSV \
    mydataset.mytable \
    ./mydata.csv

C#

Before trying this sample, follow the C# setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery C# API reference documentation.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, use the update options class for the appropriate format from the JobCreationOptions base class instead of UploadCsvOptions.


using Google.Cloud.BigQuery.V2;
using System;
using System.IO;

public class BigQueryLoadFromFile
{
    public void LoadFromFile(
        string projectId = "your-project-id",
        string datasetId = "your_dataset_id",
        string tableId = "your_table_id",
        string filePath = "path/to/file.csv"
    )
    {
        BigQueryClient client = BigQueryClient.Create(projectId);
        // Create job configuration
        var uploadCsvOptions = new UploadCsvOptions()
        {
            SkipLeadingRows = 1,  // Skips the file headers
            Autodetect = true
        };
        using (FileStream stream = File.Open(filePath, FileMode.Open))
        {
            // Create and run job
            // Note that there are methods available for formats other than CSV
            BigQueryJob job = client.UploadCsv(
                datasetId, tableId, null, stream, uploadCsvOptions);
            job.PollUntilCompleted();  // Waits for the job to complete.
            // Display the number of rows uploaded
            BigQueryTable table = client.GetTable(datasetId, tableId);
            Console.WriteLine(
                $"Loaded {table.Resource.NumRows} rows to {table.FullyQualifiedId}");
        }
    }
}

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Go API reference documentation.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the DataFormat property of the NewReaderSource to the appropriate format.

import (
	"context"
	"fmt"
	"os"

	"cloud.google.com/go/bigquery"
)

// importCSVFromFile demonstrates loading data into a BigQuery table using a file on the local filesystem.
func importCSVFromFile(projectID, datasetID, tableID, filename string) error {
	// projectID := "my-project-id"
	// datasetID := "mydataset"
	// tableID := "mytable"
	ctx := context.Background()
	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	f, err := os.Open(filename)
	if err != nil {
		return err
	}
	source := bigquery.NewReaderSource(f)
	source.AutoDetect = true   // Allow BigQuery to determine schema.
	source.SkipLeadingRows = 1 // CSV has a single header line.

	loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(source)

	job, err := loader.Run(ctx)
	if err != nil {
		return err
	}
	status, err := job.Wait(ctx)
	if err != nil {
		return err
	}
	if err := status.Err(); err != nil {
		return err
	}
	return nil
}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Java API reference documentation.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the FormatOptions to the appropriate format.

TableId tableId = TableId.of(datasetName, tableName);
WriteChannelConfiguration writeChannelConfiguration =
    WriteChannelConfiguration.newBuilder(tableId).setFormatOptions(FormatOptions.csv()).build();
// The location must be specified; other fields can be auto-detected.
JobId jobId = JobId.newBuilder().setLocation(location).build();
TableDataWriteChannel writer = bigquery.writer(jobId, writeChannelConfiguration);
// Write data to writer
try (OutputStream stream = Channels.newOutputStream(writer)) {
  Files.copy(csvPath, stream);
}
// Get load job
Job job = writer.getJob();
job = job.waitFor();
LoadStatistics stats = job.getStatistics();
return stats.getOutputRows();

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Node.js API reference documentation.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the metadata parameter of the load function to the appropriate format.

// Imports the Google Cloud client library
const {BigQuery} = require('@google-cloud/bigquery');
const bigquery = new BigQuery();

async function loadLocalFile() {
  // Imports a local file into a table.

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const filename = '/path/to/file.csv';
  // const datasetId = 'my_dataset';
  // const tableId = 'my_table';

  // Load data from a local file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(filename);

  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery PHP API reference documentation.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the sourceFormat to the appropriate format.

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId  = 'The Google project ID';
// $datasetId  = 'The BigQuery dataset ID';
// $tableId    = 'The BigQuery table ID';
// $source     = 'The path to the CSV source file to import';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$dataset = $bigQuery->dataset($datasetId);
$table = $dataset->table($tableId);
// create the import job
$loadConfig = $table->load(fopen($source, 'r'))->sourceFormat('CSV');

$job = $table->runJob($loadConfig);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    printf('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Python API reference documentation.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the LoadJobConfig.source_format property to the appropriate format.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)

with open(file_path, "rb") as source_file:
    job = client.load_table_from_file(source_file, table_id, job_config=job_config)

job.result()  # Waits for the job to complete.

table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Ruby API reference documentation.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the format parameter of the Table#load_job method to the appropriate format.

require "google/cloud/bigquery"

def load_from_file dataset_id = "your_dataset_id",
                   file_path  = "path/to/file.csv"

  bigquery = Google::Cloud::Bigquery.new
  dataset  = bigquery.dataset dataset_id
  table_id = "new_table_id"

  # Infer the config.location based on the location of the referenced dataset.
  load_job = dataset.load_job table_id, file_path do |config|
    config.skip_leading = 1
    config.autodetect   = true
  end
  load_job.wait_until_done! # Waits for table load to complete.

  table = dataset.table table_id
  puts "Loaded #{table.rows_count} rows into #{table.id}"
end

Limitations

Loading data from a local data source is subject to the following limitations:

  • Wildcards and comma-separated lists are not supported when you load files from a local data source. Files must be loaded individually.
  • When using the classic BigQuery web UI, files loaded from a local data source must be 10 MB or less and must contain fewer than 16,000 rows.

Loading compressed and uncompressed data

The Avro binary format is the preferred format for loading both compressed and uncompressed data. Avro data is faster to load because the data can be read in parallel, even when the data blocks are compressed. Compressed Avro files are not supported, but compressed data blocks are. BigQuery supports the DEFLATE and Snappy codecs for compressed data blocks in Avro files.

Parquet binary format is also a good choice because Parquet's efficient, per-column encoding typically results in a better compression ratio and smaller files. Parquet files also leverage compression techniques that allow files to be loaded in parallel. Compressed Parquet files are not supported, but compressed data blocks are. BigQuery supports Snappy, GZip, and LZO_1X codecs for compressed data blocks in Parquet files.

The ORC binary format offers benefits similar to the benefits of the Parquet format. Data in ORC files is fast to load because data stripes can be read in parallel. The rows in each data stripe are loaded sequentially. To optimize load time, use a data stripe size of approximately 256 MB or less. Compressed ORC files are not supported, but compressed file footer and stripes are. BigQuery supports Zlib, Snappy, LZO, and LZ4 compression for ORC file footers and stripes.

For other data formats such as CSV and JSON, BigQuery can load uncompressed files significantly faster than compressed files because uncompressed files can be read in parallel. Because uncompressed files are larger, using them can lead to bandwidth limitations and higher Cloud Storage costs for data staged in Cloud Storage prior to being loaded into BigQuery. Keep in mind that line ordering isn't guaranteed for compressed or uncompressed files. It's important to weigh these tradeoffs depending on your use case.

In general, if bandwidth is limited, compress your CSV and JSON files by using gzip before uploading them to Cloud Storage. Currently, when you load data into BigQuery, gzip is the only supported file compression type for CSV and JSON files. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave your files uncompressed.

Appending to or overwriting a table

You can load additional data into a table either from source files or by appending query results. If the schema of the data does not match the schema of the destination table or partition, you can update the schema when you append to it or overwrite it.

If you update the schema when appending data, BigQuery allows you to:

  • Add new fields
  • Relax REQUIRED fields to NULLABLE

If you are overwriting a table, the schema is always overwritten. Schema updates are not restricted when you overwrite a table.

In the Cloud Console, use the Write preference option to specify what action to take when you load data from a source file or from a query result. The bq command-line tool and the API include the following options:

Console option bq tool flag BigQuery API property Description
Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table --noreplace or --replace=false; if --replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

Quota policy

For information about the quota policy for batch loading data, see Load jobs on the Quotas and limits page.

Pricing

Currently, there is no charge for batch loading data into BigQuery. For more information, see the pricing page.