Batch loading data

You can load data into BigQuery from Cloud Storage or from a local file as a batch operation. The source data can be in any of the following formats:

  • Avro
  • Comma-separated values (CSV)
  • JSON (newline-delimited)
  • ORC
  • Parquet
  • Firestore exports stored in Cloud Storage.

You can also use BigQuery Data Transfer Service to set up recurring loads from Cloud Storage into BigQuery.

Try it for yourself

If you're new to Google Cloud, create an account to evaluate how BigQuery performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Try BigQuery free

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document, and create a dataset to store your data.

Required permissions

To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.

Permissions to load data into BigQuery

To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.tables.update
  • bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin (includes the bigquery.jobs.create permission)
  • bigquery.user (includes the bigquery.jobs.create permission)
  • bigquery.jobUser (includes the bigquery.jobs.create permission)

Additionally, if you have the bigquery.datasets.create permission, you can create and update tables using a load job in the datasets that you create.

For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions.

Permissions to load data from Cloud Storage

To get the permissions that you need to load data from a Cloud Storage bucket, ask your administrator to grant you the Storage Admin (roles/storage.admin) IAM role on the bucket. For more information about granting roles, see Manage access.

This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to load data from a Cloud Storage bucket:

  • storage.buckets.get
  • storage.objects.get
  • storage.objects.list (required if you are using a URI wildcard)

You might also be able to get these permissions with custom roles or other predefined roles.

Create a dataset

Create a BigQuery dataset to store your data.

Loading data from Cloud Storage

BigQuery supports loading data from any of the following Cloud Storage storage classes:

  • Standard
  • Nearline
  • Coldline
  • Archive

To learn how to load data into BigQuery, see the page for your data format:

To learn how to configure a recurring load from Cloud Storage into BigQuery, see Cloud Storage transfers.

Location considerations

When you load data from Cloud Storage by using a BigLake or a non-BigLake external table, the data you load must be colocated with your BigQuery dataset.

  • You can load data from a Cloud Storage bucket located in any location if your BigQuery dataset is located in the US multi-region."

  • Multi-region bucket: If the Cloud Storage bucket that you want to load from is located in a multi-region bucket, then your BigQuery dataset can be in the same multi-region bucket or any single region that is included in the same multi-region bucket. For example, if the Cloud Storage bucket is in the EU region, then your BigQuery dataset can be in the EU multi-region or any single region in the EU.
  • Dual-region bucket: If the Cloud Storage bucket that you want to load from is located in a dual-region bucket, then your BigQuery dataset can be located in regions that are included in the dual-region bucket, or in a multi-region that includes the dual-region. For example, if your Cloud Storage bucket is located in the EUR4 region, then your BigQuery dataset can be located in either the Finland (europe-north1) single-region, the Netherlands (europe-west4) single-region, or the EU multi-region.

    For more information, see Create a dual-region bucket.

  • Single region bucket: If your Cloud Storage bucket that you want to load from is in a single-region, your BigQuery dataset can be in the same single-region, or in the multi-region that includes the single-region. For example, if you Cloud Storage bucket is in the Finland (europe-north1) region, your BigQuery dataset can be in the Finland or the EU multi-region.

  • One exception is that if your BigQuery dataset is located in the asia-northeast1 region, then your Cloud Storage bucket can be located in the EU multi-region.

For more information about Cloud Storage locations, see Bucket locations in the Cloud Storage documentation.

You cannot change the location of a dataset after it is created, but you can make a copy of the dataset or manually move it. For more information, see:

Retrieving the Cloud Storage URI

To load data from a Cloud Storage data source, you must provide the Cloud Storage URI.

The Cloud Storage resource path contains your bucket name and your object (filename). For example, if the Cloud Storage bucket is named mybucket and the data file is named myfile.csv, the resource path would be gs://mybucket/myfile.csv.

BigQuery does not support Cloud Storage resource paths that include multiple consecutive slashes after the initial double slash. Cloud Storage object names can contain multiple consecutive slash ("/") characters. However, BigQuery converts multiple consecutive slashes into a single slash. For example, the following resource path, though valid in Cloud Storage, does not work in BigQuery: gs://bucket/my//object//name.

To retrieve the Cloud Storage resource path:

  1. Open the Cloud Storage console.

    Cloud Storage console

  2. Browse to the location of the object (file) that contains the source data.

  3. Click on the name of the object.

    The Object details page opens.

  4. Copy the value provided in the gsutil URI field, which begins with gs://.

For Google Datastore exports, only one URI can be specified, and it must end with .backup_info or .export_metadata.

Wildcard support for Cloud Storage URIs

If your data is separated into multiple files, you can use an asterisk (*) wildcard to select multiple files. Use of the asterisk wildcard must follow these rules:

  • The asterisk can appear inside the object name or at the end of the object name.
  • Using multiple asterisks is unsupported. For example, the path gs://mybucket/fed-*/temp/*.csv is invalid.
  • Using an asterisk with the bucket name is unsupported.

Examples:

  • The following example shows how to select all of the files in all the folders which start with the prefix gs://mybucket/fed-samples/fed-sample:

    gs://mybucket/fed-samples/fed-sample*
    
  • The following example shows how to select only files with a .csv extension in the folder named fed-samples and any subfolders of fed-samples:

    gs://mybucket/fed-samples/*.csv
    
  • The following example shows how to select files with a naming pattern of fed-sample*.csv in the folder named fed-samples. This example doesn't select files in subfolders of fed-samples.

    gs://mybucket/fed-samples/fed-sample*.csv
    

When using the bq command-line tool, you might need to escape the asterisk on some platforms.

You can't use an asterisk wildcard when you load Datastore or Firestore export data from Cloud Storage.

Limitations

You are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket:

  • If your dataset's location is set to a value other than the US multi-region, then the Cloud Storage bucket must be in the same region or contained in the same multi-region as the dataset.
  • BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
  • BigQuery does not support Cloud Storage object versioning. If you include a generation number in the Cloud Storage URI, then the load job fails.

Depending on the format of your Cloud Storage source data, there may be additional limitations. For more information, see:

Loading data from local files

You can load data from a readable data source (such as your local machine) by using one of the following:

  • The Google Cloud console
  • The bq command-line tool's bq load command
  • The API
  • The client libraries

When you load data using the Google Cloud console or the bq command-line tool, a load job is automatically created.

To load data from a local data source:

Console

  1. Open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the Explorer panel, expand your project and select a dataset.

  3. Expand the Actions option and click Open.

  4. In the details panel, click Create table .

  5. On the Create table page, in the Source section:

    • For Create table from, select Upload.
    • For Select file, click Browse.
    • Browse to the file, and click Open. Note that wildcards and comma-separated lists are not supported for local files.
    • For File format, select CSV, JSON (newline delimited), Avro, Parquet, or ORC.
  6. On the Create table page, in the Destination section:

    • For Project, choose the appropriate project.
    • For Dataset, choose the appropriate dataset.
    • In the Table field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  7. In the Schema section, enter the schema definition.

    • For CSV and JSON files, you can check the Auto-detect option to enable schema auto-detect. Schema information is self-described in the source data for other supported file types.

    • You can also enter schema information manually by:

      • Clicking Edit as text and entering the table schema as a JSON array:

      • Using Add Field to manually input the schema.

  8. Select applicable items in the Advanced options section For information on the available options, see CSV options and JSON options.

  9. Optional: In the Advanced options choose the write disposition:

    • Write if empty: Write the data only if the table is empty.
    • Append to table: Append the data to the end of the table. This setting is the default.
    • Overwrite table: Erase all existing data in the table before writing the new data.
  10. Click Create Table.

bq

Use the bq load command, specify the source_format, and include the path to the local file.

(Optional) Supply the --location flag and set the value to your location.

If you are loading data in a project other than your default project, add the project ID to the dataset in the following format: PROJECT_ID:DATASET.

bq --location=LOCATION load \
--source_format=FORMAT \
PROJECT_ID:DATASET.TABLE \
PATH_TO_SOURCE \
SCHEMA

Replace the following:

  • LOCATION: your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, set the flag's value to asia-northeast1. You can set a default value for the location by using the .bigqueryrc file.
  • FORMAT: CSV, AVRO, PARQUET, ORC, or NEWLINE_DELIMITED_JSON.
  • project_id: your project ID.
  • dataset: an existing dataset.
  • table: the name of the table into which you're loading data.
  • path_to_source: the path to the local file.
  • schema: a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

In addition, you can add flags for options that let you control how BigQuery parses your data. For example, you can use the --skip_leading_rows flag to ignore header rows in a CSV file. For more information, see CSV options and JSON options.

Examples:

The following command loads a local newline-delimited JSON file (mydata.json) into a table named mytable in mydataset in your default project. The schema is defined in a local schema file named myschema.json.

    bq load \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    ./mydata.json \
    ./myschema.json

The following command loads a local CSV file (mydata.csv) into a table named mytable in mydataset in myotherproject. The schema is defined inline in the format FIELD:DATA_TYPE, FIELD:DATA_TYPE.

    bq load \
    --source_format=CSV \
    myotherproject:mydataset.mytable \
    ./mydata.csv \
    qtr:STRING,sales:FLOAT,year:STRING

The following command loads a local CSV file (mydata.csv) into a table named mytable in mydataset in your default project. The schema is defined using schema auto-detection.

    bq load \
    --autodetect \
    --source_format=CSV \
    mydataset.mytable \
    ./mydata.csv

C#

Before trying this sample, follow the C# setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery C# API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, use the update options class for the appropriate format from the JobCreationOptions base class instead of UploadCsvOptions.


using Google.Cloud.BigQuery.V2;
using System;
using System.IO;

public class BigQueryLoadFromFile
{
    public void LoadFromFile(
        string projectId = "your-project-id",
        string datasetId = "your_dataset_id",
        string tableId = "your_table_id",
        string filePath = "path/to/file.csv"
    )
    {
        BigQueryClient client = BigQueryClient.Create(projectId);
        // Create job configuration
        var uploadCsvOptions = new UploadCsvOptions()
        {
            SkipLeadingRows = 1,  // Skips the file headers
            Autodetect = true
        };
        using (FileStream stream = File.Open(filePath, FileMode.Open))
        {
            // Create and run job
            // Note that there are methods available for formats other than CSV
            BigQueryJob job = client.UploadCsv(
                datasetId, tableId, null, stream, uploadCsvOptions);
            job = job.PollUntilCompleted().ThrowOnAnyError();  // Waits for the job to complete.

            // Display the number of rows uploaded
            BigQueryTable table = client.GetTable(datasetId, tableId);
            Console.WriteLine(
                $"Loaded {table.Resource.NumRows} rows to {table.FullyQualifiedId}");
        }
    }
}

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the DataFormat property of the NewReaderSource to the appropriate format.

import (
	"context"
	"fmt"
	"os"

	"cloud.google.com/go/bigquery"
)

// importCSVFromFile demonstrates loading data into a BigQuery table using a file on the local filesystem.
func importCSVFromFile(projectID, datasetID, tableID, filename string) error {
	// projectID := "my-project-id"
	// datasetID := "mydataset"
	// tableID := "mytable"
	ctx := context.Background()
	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	f, err := os.Open(filename)
	if err != nil {
		return err
	}
	source := bigquery.NewReaderSource(f)
	source.AutoDetect = true   // Allow BigQuery to determine schema.
	source.SkipLeadingRows = 1 // CSV has a single header line.

	loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(source)

	job, err := loader.Run(ctx)
	if err != nil {
		return err
	}
	status, err := job.Wait(ctx)
	if err != nil {
		return err
	}
	if err := status.Err(); err != nil {
		return err
	}
	return nil
}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the FormatOptions to the appropriate format.

TableId tableId = TableId.of(datasetName, tableName);
WriteChannelConfiguration writeChannelConfiguration =
    WriteChannelConfiguration.newBuilder(tableId).setFormatOptions(FormatOptions.csv()).build();
// The location must be specified; other fields can be auto-detected.
JobId jobId = JobId.newBuilder().setLocation(location).build();
TableDataWriteChannel writer = bigquery.writer(jobId, writeChannelConfiguration);
// Write data to writer
try (OutputStream stream = Channels.newOutputStream(writer)) {
  Files.copy(csvPath, stream);
}
// Get load job
Job job = writer.getJob();
job = job.waitFor();
LoadStatistics stats = job.getStatistics();
return stats.getOutputRows();

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the metadata parameter of the load function to the appropriate format.

// Imports the Google Cloud client library
const {BigQuery} = require('@google-cloud/bigquery');
const bigquery = new BigQuery();

async function loadLocalFile() {
  // Imports a local file into a table.

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const filename = '/path/to/file.csv';
  // const datasetId = 'my_dataset';
  // const tableId = 'my_table';

  // Load data from a local file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(filename);

  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery PHP API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the sourceFormat to the appropriate format.

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId  = 'The Google project ID';
// $datasetId  = 'The BigQuery dataset ID';
// $tableId    = 'The BigQuery table ID';
// $source     = 'The path to the CSV source file to import';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$dataset = $bigQuery->dataset($datasetId);
$table = $dataset->table($tableId);
// create the import job
$loadConfig = $table->load(fopen($source, 'r'))->sourceFormat('CSV');

$job = $table->runJob($loadConfig);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    printf('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the LoadJobConfig.source_format property to the appropriate format.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)

with open(file_path, "rb") as source_file:
    job = client.load_table_from_file(source_file, table_id, job_config=job_config)

job.result()  # Waits for the job to complete.

table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Ruby API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

The following code demonstrates how to load a local CSV file to a new BigQuery table. To load a local file of another format, set the format parameter of the Table#load_job method to the appropriate format.

require "google/cloud/bigquery"

def load_from_file dataset_id = "your_dataset_id",
                   file_path  = "path/to/file.csv"

  bigquery = Google::Cloud::Bigquery.new
  dataset  = bigquery.dataset dataset_id
  table_id = "new_table_id"

  # Infer the config.location based on the location of the referenced dataset.
  load_job = dataset.load_job table_id, file_path do |config|
    config.skip_leading = 1
    config.autodetect   = true
  end
  load_job.wait_until_done! # Waits for table load to complete.

  table = dataset.table table_id
  puts "Loaded #{table.rows_count} rows into #{table.id}"
end

Limitations

Loading data from a local data source is subject to the following limitations:

  • Wildcards and comma-separated lists are not supported when you load files from a local data source. Files must be loaded individually.
  • When using the Google Cloud console, files loaded from a local data source cannot exceed 100 MB. For larger files, load the file from Cloud Storage.

Loading compressed and uncompressed data

For Avro, Parquet, and ORC formats, BigQuery supports loading files where the file data has been compressed using a supported codec. However, BigQuery doesn't support loading files in these formats that have themselves been compressed, for example by using the gzip utility.

The Avro binary format is the preferred format for loading both compressed and uncompressed data. Avro data is faster to load because the data can be read in parallel, even when the data blocks are compressed. For a list of supported compression codecs, see Avro compression.

Parquet binary format is also a good choice because Parquet's efficient, per-column encoding typically results in a better compression ratio and smaller files. Parquet files also leverage compression techniques that allow files to be loaded in parallel. For a list of supported compression codecs, see Parquet compression.

The ORC binary format offers benefits similar to the benefits of the Parquet format. Data in ORC files is fast to load because data stripes can be read in parallel. The rows in each data stripe are loaded sequentially. To optimize load time, use a data stripe size of approximately 256 MB or less. For a list of supported compression codecs, see ORC compression.

For other data formats such as CSV and JSON, BigQuery can load uncompressed files significantly faster than compressed files because uncompressed files can be read in parallel. Because uncompressed files are larger, using them can lead to bandwidth limitations and higher Cloud Storage costs for data staged in Cloud Storage prior to being loaded into BigQuery. Keep in mind that line ordering isn't guaranteed for compressed or uncompressed files. It's important to weigh these tradeoffs depending on your use case.

In general, if bandwidth is limited, compress your CSV and JSON files by using gzip before uploading them to Cloud Storage. gzip is the only supported file compression type for CSV and JSON files when loading data into BigQuery. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave your files uncompressed.

Appending to or overwriting a table

You can load additional data into a table either from source files or by appending query results. If the schema of the data does not match the schema of the destination table or partition, you can update the schema when you append to it or overwrite it.

If you update the schema when appending data, BigQuery allows you to:

  • Add new fields
  • Relax REQUIRED fields to NULLABLE

If you are overwriting a table, the schema is always overwritten. Schema updates are not restricted when you overwrite a table.

In the Google Cloud console, use the Write preference option to specify what action to take when you load data from a source file or from a query result. The bq command-line tool and the API include the following options:

Console option bq tool flag BigQuery API property Description
Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table --noreplace or --replace=false; if --replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

Quota policy

For information about the quota policy for batch loading data, see Load jobs on the Quotas and limits page.

View current quota usage

You can view your current usage of query, load, extract, or copy jobs by running an INFORMATION_SCHEMA query to view metadata about the jobs ran over a specified time period. You can compare your current usage against the quota limit to determine your quota usage for a particular type of job. The following example query uses the INFORMATION_SCHEMA.JOBS view to list the number of query, load, extract, and copy jobs by project:

SELECT
  sum(case  when job_type="QUERY" then 1 else 0 end) as QRY_CNT,
  sum(case  when job_type="LOAD" then 1 else 0 end) as LOAD_CNT,
  sum(case  when job_type="EXTRACT" then 1 else 0 end) as EXT_CNT,
  sum(case  when job_type="COPY" then 1 else 0 end) as CPY_CNT
FROM `region-eu`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE date(creation_time)= CURRENT_DATE()

Pricing

There is no charge for batch loading data into BigQuery. For more information, see BigQuery data ingestion pricing.

Example use case

Suppose there is a nightly batch processing pipeline that needs to be completed by a fixed deadline. Data needs to be available by this deadline for further processing by another batch process to generate reports to be sent to a regulator. This use case is common in regulated industries such as finance.

Batch loading of data with load jobs is the right approach for this use case because latency is not a concern provided the deadline can be met. Ensure your Cloud Storage buckets meet the location requirements for loading data into the BigQuery dataset.

The result of a BigQuery load job is atomic; either all records get inserted or none do. As a best practice, when inserting all data in a single load job, create a new table by using the WRITE_TRUNCATE disposition of the JobConfigurationLoad resource. This is important when retrying a failed load job, as the client might not be able to distinguish between jobs that have failed and the failure caused by for example in communicating the success state back to the client.

Assuming data to be ingested has been successfully copied to Cloud Storage already, retrying with exponential backoff is sufficient to address ingestion failures.

It's recommended that a nightly batch job doesn't hit the default quota of 1,500 loads per table per day even with retries. When loading data incrementally, the default quota is sufficient for running a load job every 5 minutes and have unconsumed quota for at least 1 retry per job on average.