Loading CSV Data from Google Cloud Storage

Loading CSV files from Cloud Storage

When you load CSV data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

For information about loading CSV data from a local file, see Loading data into BigQuery from a local data source.

Limitations

When you load CSV data from Cloud Storage into BigQuery, note the following:

  • CSV files do not support nested or repeated data.
  • If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed CSV data into BigQuery is slower than loading uncompressed data.

CSV encoding

BigQuery expects CSV data to be UTF-8 encoded. If you have CSV files with data encoded in ISO-8859-1 (also known as Latin-1) format, you should explicitly specify the encoding when you load your data so it can be converted to UTF-8.

Delimiters in CSV files can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. BigQuery converts the string to ISO-8859-1 encoding and uses the first byte of the encoded string to split the data in its raw, binary state.

Required permissions

When you load data into BigQuery, you need project or dataset-level permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need access to the bucket that contains your data.

BigQuery permissions

When you are loading data into BigQuery from Cloud Storage, you must be granted the bigquery.dataOwner or bigquery.dataEditor role at the project level or at the dataset level. Both roles grant users and groups permission to load data into a new table or to append to or overwrite an existing table.

Granting the roles at the project level gives the user or group permission to load data into tables in every dataset in the project. Granting the roles at the dataset level gives the user or group the ability to load data only into tables in that dataset.

For more information on configuring dataset access, see Assigning access controls to datasets. For more information on IAM roles in BigQuery, see Access Control.

Cloud Storage permissions

In order to load data from a Cloud Storage bucket, you must be granted storage.objects.get permissions at the project level or on that individual bucket. If you are using a URI wildcard, you must also have storage.objects.list permissions.

The predefined IAM role storage.objectViewer can be granted to provide storage.objects.getand storage.objects.list permissions.

Loading CSV data into a new table

To load CSV data from Google Cloud Storage into a new BigQuery table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported.
    • For File format, select Comma-separated values (CSV).
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  5. In the Schema section, enter the schema definition.

    • For CSV files, you can check the Auto-detect option to enable schema auto-detection.

      auto detect link

    • You can also enter schema information manually by:

      • Clicking Edit as text and entering the table schema as a JSON array:

        Add schema as JSON array

      • Using Add Field to manually input the schema:

        Add schema using add fields

  6. Select applicable items in the Options section and then click Create Table. For information on the available options, see CSV options.

Command-line

Use the bq load command, specify CSV as the source_format, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs or a URI containing a wildcard:

bq load --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA]

where:

  • [FORMAT] is CSV.
  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
  • [SCHEMA] is a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

In addition, you can add flags for CSV options that allow you to control how BigQuery parses your data. For example, you can use the --skip_leading_rows flag to ignore header rows in the CSV file, and you can use the --encoding flag to identify the character encoding of the data.

Examples:

  • The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset. The schema is defined in a local schema file named myschema.json:

    bq load --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
    
  • The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset. The schema is defined inline in the format [FIELD]:[DATA_TYPE], [FIELD]:[DATA_TYPE]:

    bq load --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv qtr:STRING,sales:FLOAT,year:STRING
    

    When you specify the schema on the command line, you cannot include a RECORD (STRUCT) type, you cannot include a field description, and you cannot specify the field mode. All field modes default to NULLABLE. To include field descriptions, modes, and RECORD types, supply a JSON schema file instead.

  • The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset. The schema is defined using schema auto-detection:

    bq load --autodetect --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv
    
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The Cloud Storage URI uses a wildcard, and the schema is defined using schema auto-detection:

    bq load --autodetect --source_format=CSV mydataset.mytable gs://mybucket/mydata*.csv
    
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The command includes a comma-separated list of Cloud Storage URIs, and the schema is defined using schema auto-detection:

    bq load --autodetect --source_format=CSV mydataset.mytable "gs://mybucket/myfile.csv,gs://mybucket/myfile2.csv"
    
  • The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset. The schema is defined in a local schema file named myschema.json, and the --skip_leading_rows flag is used to ignore the first two header rows in the CSV file:

    bq load --skip_leading_rows=2 --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
    

API

Set the following properties to load CSV data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. Each URI can contain one '*' wildcard character.

  3. Specify the CSV data format by setting the configuration.load.sourceFormat property to CSV.

  4. To check the job status, call jobs.get([JOB_ID]*), where [JOB_ID] is the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is added.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert() to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

C#

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

public void ImportDataFromCloudStorage(string projectId, string datasetId,
    string tableId, BigQueryClient client, string fileName, string folder = null)
{
    StorageClient gcsClient = StorageClient.Create();

    using (var stream = new MemoryStream())
    {
        // Set Cloud Storage Bucket name. This uses a bucket named the same as the project.
        string bucket = projectId;
        // If folder is passed in, add it to Cloud Storage File Path using "/" character
        string filePath = string.IsNullOrEmpty(folder) ? fileName : folder + "/" + fileName;
        // Download Google Cloud Storage object into stream
        gcsClient.DownloadObject(projectId, filePath, stream);

        // This example uploads data to an existing table. If the upload will create a new table
        // or if the schema in the JSON isn't identical to the schema in the table,
        // create a schema to pass into the call instead of passing in a null value.
        BigQueryJob job = client.UploadJson(datasetId, tableId, null, stream);
        // Use the job to find out when the data has finished being inserted into the table,
        // report errors etc.

        // Wait for the job to complete.
        job.PollUntilCompleted();
    }
}

Go

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

// For example, "gs://data-bucket/path/to/data.csv"
gcsRef := bigquery.NewGCSReference(gcsURI)
gcsRef.AllowJaggedRows = true
// TODO: set other options on the GCSReference.

loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)
loader.CreateDisposition = bigquery.CreateNever
// TODO: set other options on the Loader.

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}
if err := status.Err(); err != nil {
	return err
}

Java

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

Job job = table.load(FormatOptions.csv(), sourceUri);
// Wait for the job to complete
try {
  Job completedJob = job.waitFor(RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
            RetryOption.totalTimeout(Duration.ofMinutes(3)));
  if (completedJob != null && completedJob.getStatus().getError() == null) {
    // Job completed successfully
  } else {
    // Handle error case
  }
} catch (InterruptedException e) {
  // Handle interrupted wait
}

Node.js

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

// Imports the Google Cloud client libraries
const BigQuery = require('@google-cloud/bigquery');
const Storage = require('@google-cloud/storage');

// The project ID to use, e.g. "your-project-id"
// const projectId = "your-project-id";

// The ID of the dataset of the table into which data should be imported, e.g. "my_dataset"
// const datasetId = "my_dataset";

// The ID of the table into which data should be imported, e.g. "my_table"
// const tableId = "my_table";

// The name of the Google Cloud Storage bucket where the file is located, e.g. "my-bucket"
// const bucketName = "my-bucket";

// The name of the file from which data should be imported, e.g. "file.csv"
// const filename = "file.csv";

// Instantiates clients
const bigquery = BigQuery({
  projectId: projectId
});

const storage = Storage({
  projectId: projectId
});

let job;

// Imports data from a Google Cloud Storage file into the table
bigquery
  .dataset(datasetId)
  .table(tableId)
  .import(storage.bucket(bucketName).file(filename))
  .then((results) => {
    job = results[0];
    console.log(`Job ${job.id} started.`);

    // Wait for the job to finish
    return job.promise();
  })
  .then((results) => {
    // Get the job's status
    return job.getMetadata();
  }).then((metadata) => {
    // Check the job's status for errors
    const errors = metadata[0].status.errors;
    if (errors && errors.length > 0) {
      throw errors;
    }
  }).then(() => {
    console.log(`Job ${job.id} completed.`);
  })
  .catch((err) => {
    console.error('ERROR:', err);
  });

PHP

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Storage\StorageClient;
use Google\Cloud\Core\ExponentialBackoff;

/**
 * @param string $projectId  The Google project ID.
 * @param string $datasetId  The BigQuery dataset ID.
 * @param string $tableId    The BigQuery table ID.
 * @param string $bucketName The Cloud Storage bucket Name.
 * @param string $objectName The Cloud Storage object Name.
 */
function import_from_storage($projectId, $datasetId, $tableId, $bucketName, $objectName)
{
    // determine the import options from the object name
    $options = [];
    if ('.backup_info' === substr($objectName, -12)) {
        $options['jobConfig'] = ['sourceFormat' => 'DATASTORE_BACKUP'];
    } elseif ('.json' === substr($objectName, -5)) {
        $options['jobConfig'] = ['sourceFormat' => 'NEWLINE_DELIMITED_JSON'];
    }
    // instantiate the bigquery table service
    $bigQuery = new BigQueryClient([
        'projectId' => $projectId,
    ]);
    $dataset = $bigQuery->dataset($datasetId);
    $table = $dataset->table($tableId);
    // load the storage object
    $storage = new StorageClient([
        'projectId' => $projectId,
    ]);
    $object = $storage->bucket($bucketName)->object($objectName);
    // create the import job
    $job = $table->loadFromStorage($object, $options);
    // poll the job until it is complete
    $backoff = new ExponentialBackoff(10);
    $backoff->execute(function () use ($job) {
        print('Waiting for job to complete' . PHP_EOL);
        $job->reload();
        if (!$job->isComplete()) {
            throw new Exception('Job has not yet completed', 500);
        }
    });
    // check if the job has errors
    if (isset($job->info()['status']['errorResult'])) {
        $error = $job->info()['status']['errorResult']['message'];
        printf('Error running job: %s' . PHP_EOL, $error);
    } else {
        print('Data imported successfully' . PHP_EOL);
    }
}

Python

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

def load_data_from_gcs(dataset_id, table_id, source):
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)

    job = bigquery_client.load_table_from_uri(source, table_ref)

    job.result()  # Waits for job to complete

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_id, table_id))

Ruby

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

# project_id   = "Your Google Cloud project ID"
# dataset_id   = "ID of the dataset containing table"
# table_id     = "ID of the table to import data into"
# storage_path = "Storage path to file to import, eg. gs://bucket/file.csv"

require "google/cloud/bigquery"

bigquery = Google::Cloud::Bigquery.new project: project_id
dataset  = bigquery.dataset dataset_id
table    = dataset.table table_id

puts "Importing data from Cloud Storage file: #{storage_path}"
load_job = table.load storage_path

puts "Waiting for load job to complete: #{load_job.job_id}"
load_job.wait_until_done!

puts "Data imported"

Appending to or overwriting a table with CSV data

You can load additional data into a table either from source files or by appending query results. If the schema of the data does not match the schema of the destination table or partition, you can update the schema when you append to it or overwrite it.

If you update the schema when appending data, BigQuery allows you to:

  • Add new fields
  • Relax REQUIRED fields to NULLABLE

If you are overwriting a table, the schema is always overwritten. Schema updates are not restricted when you overwrite a table.

In the BigQuery web UI, you use the Write preference option to specify what action to take when you load data from a source file or from a query result. The CLI and API include the following options:

Web UI option CLI flag BigQuery API property Description
Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table --noreplace or --replace=false; if --[no]replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

To load CSV data from Google Cloud Storage and to append to or overwrite a BigQuery table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the UI, but wildcards are supported.
    • For File format, select Comma-separated values (CSV).
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're appending or overwriting.
    • Verify that Table type is set to Native table.
  5. In the Schema section, enter the schema definition.

    • For CSV files, you can check the Auto-detect option to enable schema auto-detection.

      auto detect link

    • You can also enter schema information manually by:

      • Clicking Edit as text and entering the table schema as a JSON array:

        Add schema as JSON array

      • Using Add Field to manually input the schema:

        Add schema using add fields

  6. In the Options section, for Write preference, choose Write if empty, Append to table, or Overwrite table.

    Add schema using add fields

  7. Click Create Table.

Command-line

Enter the bq load command with the --replace flag to overwrite the table. Use the --noreplace flag to append data to the table. If no flag is specified, the default is to append data.

When appending or overwriting a table, you can use the --schema_update_option flag to update the schema of the destination table with the schema of the new data. The following options can be used with the --schema_update_option flag:

  • ALLOW_FIELD_ADDITION: Adds new fields to the schema; new fields cannot be REQUIRED
  • ALLOW_FIELD_RELAXATION: Relaxes required fields to nullable; repeat this option to specify a list of values

bq load --[no]replace [DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA]

where:

  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
  • [SCHEMA] is a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

In addition, you can add flags for CSV options that allow you to control how BigQuery parses your CSV data. For example, you can use the --skip_leading_rows flag to ignore header rows in the CSV file, and you can use the --encoding flag to identify the character encoding of the data.

Examples:

  • The following command loads data from gs://mybucket/mydata.csv and overwrites a table named mytable in mydataset. The schema is defined using schema auto-detection:

    bq load --autodetect --replace --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv
    
  • The following command loads data from gs://mybucket/mydata.csv and appends data to a table named mytable in mydataset. The schema is defined using a JSON schema file — myschema.json:

    bq load --autodetect --noreplace --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
    
  • The following command loads data from gs://mybucket/mydata.csv and appends data to a table named mytable in mydataset. A local JSON schema file named myschema.json is used. The schema definition contains new fields not present in the destination table:

    bq load --noreplace --schema_update_option=ALLOW_FIELD_ADDITION --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
    
  • The following command loads data from gs://mybucket/mydata.csv and appends data to a table named mytable in mydataset. A local JSON schema file named myschema.json is used. The schema definition changes (relaxes) two REQUIRED fields to NULLABLE:

    bq load --noreplace --schema_update_option=ALLOW_FIELD_RELAXATION --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
    

API

Set the following properties to load CSV data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported when loading CSV data from Google Cloud Storage.

  3. Specify the data format by setting the configuration.load.sourceFormat property to CSV.

  4. Specify the write preference by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE, WRITE_APPEND, or WRITE_EMPTY.

  5. To update the schema in your load job, set the configuration.load.schemaUpdateOptions property to ALLOW_FIELD_ADDITION or ALLOW_FIELD_RELAXATION.

CSV options

To change how BigQuery parses CSV data, specify additional options in the UI, CLI, or API. For more information on the CSV format, see RFC 4180.

CSV option Web UI option CLI flag BigQuery API property Description
Field delimiter Field deimiter: Comma, Tab, Pipe, Other -F or --field_delimiter fieldDelimiter (Optional) The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (`,`).
Header rows Header rows to skip --skip_leading_rows skipLeadingRows (Optional) An integer indicating the number of header rows in the source data.
Number of bad records allowed Number of errors allowed --max_bad_records maxBadRecords (Optional) The maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an invalid error is returned in the job result. The default value is 0, which requires that all records are valid.
Newline characters Allow quoted newlines --allow_quoted_newlines allowQuotedNewlines (Optional) Indicates whether to allow quoted data sections that contain newline characters in a CSV file. The default value is false.
Custom null values None --null_marker nullMarker (Optional) Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string. If you set this property to a custom value, BigQuery throws an error if an empty string is present for all data types except for STRING and BYTE. For STRING and BYTE columns, BigQuery interprets the empty string as an empty value.
Trailing optional columns Allow jagged rows --allow_jagged_rows allowJaggedRows (Optional) Accept rows that are missing trailing optional columns. The missing values are treated as nulls. If false, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. Only applicable to CSV, ignored for other formats.
Unknown values Ignore unknown values --ignore_unknown_values ignoreUnknownValues (Optional) Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value: CSV: Trailing columns JSON: Named values that don't match any column names
Quote None --quote quote (Optional) The value that is used to quote data sections in a CSV file. BigQuery converts the string to ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data in its raw, binary state. The default value is a double-quote ('"'). If your data does not contain quoted sections, set the property value to an empty string. If your data contains quoted newline characters, you must also set the allowQuotedNewlines property to true.
Encoding None -E or --encoding encoding (Optional) The character encoding of the data. The supported values are UTF-8 or ISO-8859-1. The default value is UTF-8. BigQuery decodes the data after the raw, binary data has been split using the values of the quote and fieldDelimiter properties.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...