Loading CSV data from Cloud Storage

Loading CSV files from Cloud Storage

When you load CSV data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.

For information about loading CSV data from a local file, see Loading data into BigQuery from a local data source.

Limitations

When you load CSV data from Cloud Storage into BigQuery, note the following:

  • CSV files do not support nested or repeated data.
  • If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed CSV data into BigQuery is slower than loading uncompressed data.
  • You cannot include both compressed and uncompressed files in the same load job.
  • When you load CSV or JSON data, values in DATE columns must use the dash (-) separator and the date must be in the following format: YYYY-MM-DD (year-month-day).
  • When you load JSON or CSV data, values in TIMESTAMP columns must use a dash (-) separator for the date portion of the timestamp, and the date must be in the following format: YYYY-MM-DD (year-month-day). The hh:mm:ss (hour-minute-second) portion of the timestamp must use a colon (:) separator.

CSV encoding

BigQuery expects CSV data to be UTF-8 encoded. If you have CSV files with data encoded in ISO-8859-1 (also known as Latin-1) format, you should explicitly specify the encoding when you load your data so it can be converted to UTF-8.

Delimiters in CSV files can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. BigQuery converts the string to ISO-8859-1 encoding and uses the first byte of the encoded string to split the data in its raw, binary state.

Required permissions

When you load data into BigQuery, you need permissions to run a load job and permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need permissions to access to the bucket that contains your data.

BigQuery permissions

At a minimum, the following permissions are required to load data into BigQuery. These permissions are required if you are loading data into a new table or partition, or if you are appending or overwriting a table or partition.

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.jobs.create

The following predefined Cloud IAM roles include both bigquery.tables.create and bigquery.tables.updateData permissions:

  • bigquery.dataEditor
  • bigquery.dataOwner
  • bigquery.admin

The following predefined Cloud IAM roles include bigquery.jobs.create permissions:

  • bigquery.user
  • bigquery.jobUser
  • bigquery.admin

In addition, if a user has bigquery.datasets.create permissions, when that user creates a dataset, they are granted bigquery.dataOwner access to it. bigquery.dataOwner access gives the user the ability to create and update tables in the dataset via a load job.

For more information on Cloud IAM roles and permissions in BigQuery, see Access control.

Cloud Storage permissions

In order to load data from a Cloud Storage bucket, you must be granted storage.objects.get permissions. If you are using a URI wildcard, you must also have storage.objects.list permissions.

The predefined Cloud IAM role storage.objectViewer can be granted to provide both storage.objects.get and storage.objects.list permissions.

Loading CSV data into a table

You can load CSV data from Cloud Storage into a new BigQuery table by:

  • Using the GCP Console or the classic web UI
  • Using the CLI's bq load command
  • Calling the jobs.insert API method and configuring a load job
  • Using the client libraries

To load CSV data from Cloud Storage into a new BigQuery table:

Console

  1. Open the BigQuery web UI in the GCP Console.
    Go to the GCP Console

  2. In the navigation panel, in the Resources section, expand your project and select a dataset.

  3. On the right side of the window, in the details panel, click Create table. The process for loading data is the same as the process for creating an empty table.

    Create table

  4. On the Create table page, in the Source section:

    • For Create table from, select Cloud Storage.

    • In the source field, browse to or enter the Cloud Storage URI. Note that you cannot include multiple URIs in the GCP Console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.

      Select file

    • For File format, select CSV.

  5. On the Create table page, in the Destination section:

    • For Dataset name, choose the appropriate dataset.

      View dataset

    • Verify that Table type is set to Native table.

    • In the Table name field, enter the name of the table you're creating in BigQuery.

  6. In the Schema section, for Auto detect, check Schema and input parameters to enable schema auto detection. Alternatively, you can manually enter the schema definition by:

    • Enabling Edit as text and entering the table schema as a JSON array.

      Add schema as JSON array

    • Using Add field to manually input the schema.

      Add schema definition using the Add Field button

  7. (Optional) To partition the table, choose your options in the Partition and cluster settings:

    • To create a partitioned table, click No partitioning, select Partition by field and choose a DATE or TIMESTAMP column. This option is unavailable if your schema does not include a DATE or TIMESTAMP column.
    • To create an ingestion-time partitioned table, click No partitioning and select Partition by ingestion time.
  8. (Optional) For Partitioning filter, click the Require partition filter box to require users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables. This option is unavailable if No partitioning is selected.

  9. (Optional) To cluster the table, in the Clustering order box, enter between one and four field names. Currently, clustering is supported only for partitioned tables.

  10. (Optional) Click Advanced options.

    • For Write preference, leave Write if empty selected. This option creates a new table and loads your data into it.
    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Unknown values, check Ignore unknown values to ignore any values in a row that are not present in the table's schema.
    • For Field delimiter, choose the character that separates the cells in your CSV file: Comma, Tab, Pipe, or Custom. If you choose Custom, enter the delimiter in the Custom field delimiter box. The default value is Comma.
    • For Header rows to skip, enter the number of header rows to skip at the top of the CSV file. The default value is 0.
    • For Quoted newlines, check Allow quoted newlines to allow quoted data sections that contain newline characters in a CSV file. The default value is false.
    • For Jagged rows, check Allow jagged rows to accept rows in CSV files that are missing trailing optional columns. The missing values are treated as nulls. If unchecked, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false.
    • For Encryption, click Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
  11. Click Create table.

Classic UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • Click Create from source.
    • For Location, select Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.
    • For File format, select CSV.
  4. In the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  5. In the Schema section, for Auto detect, check Schema and input parameters to enable schema auto detection. Alternatively, you can manually enter the schema definition by:

    • Clicking Edit as text and entering the table schema as a JSON array.

      Add schema as JSON array

    • Using Add Field to manually input the schema:

      Add schema using add fields

  6. (Optional) In the Options section:

    • For Field delimiter, choose the character that separates the cells in your CSV file: Comma, Tab, Pipe, or Other. If you choose Other, enter the delimiter in the Custom field delimiter box. The default value is Comma.
    • For Header rows to skip, enter the number of header rows to skip at the top of the CSV file. The default value is 0.
    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Allow quoted newlines, check the box to allow quoted data sections that contain newline characters in a CSV file. The default value is false.
    • For Allow jagged rows, check the box to accept rows in CSV files that are missing trailing optional columns. The missing values are treated as nulls. If unchecked, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false.
    • For Ignore unknown values, check the box to ignore any values in a row that are not present in the table's schema.
    • For Write preference, leave Write if empty selected. This option creates a new table and loads your data into it.
    • To partition the table:
      • For Partitioning Type, click None and choose Day.
      • For Partitioning Field:
      • To create a partitioned table, choose a DATE or TIMESTAMP column. This option is unavailable if your schema does not include a DATE or TIMESTAMP column.
      • To create an ingestion-time partitioned table, leave the default value: _PARTITIONTIME.
      • Click the Require partition filter box to require users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables. This option is unavailable if Partitioning type is set to None.
    • To cluster the table, in the Clustering fields box, enter between one and four field names.
    • For Destination encryption, choose Customer-managed encryption to use a Cloud Key Management Service key to encrypt the table. If you leave the Default setting, BigQuery encrypts the data at rest using a Google-managed key.
  7. Click Create Table.

CLI

Use the bq load command, specify CSV using the --source_format flag, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs, or a URI containing a wildcard. Supply the schema inline, in a schema definition file, or use schema auto-detect.

(Optional) Supply the --location flag and set the value to your location.

Other optional flags include:

  • --allow_jagged_rows: When specified, accept rows in CSV files that are missing trailing optional columns. The missing values are treated as nulls. If unchecked, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false.
  • --allow_quoted_newlines: When specified, allows quoted data sections that contain newline characters in a CSV file. The default value is false.
  • --field_delimiter: The character that indicates the boundary between columns in the data. Both \t and tab are allowed for tab delimiters. The default value is ,.
  • --null_marker: An optional custom string that represents a NULL value in CSV data.
  • --skip_leading_rows: Specifies the number of header rows to skip at the top of the CSV file. The default value is 0.
  • --quote: The quote character to use to enclose records. The default value is ". To indicate no quote character, use an empty string.
  • --max_bad_records: An integer that specifies the maximum number of bad records allowed before the entire job fails. The default value is 0. At most, five errors of any type are returned regardless of the --max_bad_records value.
  • --ignore_unknown_values: When specified, allows and ignores extra, unrecognized values in CSV or JSON data.
  • --autodetect: When specified, enable schema auto-detection for CSV and JSON data.
  • --time_partitioning_type: Enables time-based partitioning on a table and sets the partition type. Currently, the only possible value is DAY which generates one partition per day. This flag is optional when you create a table partitioned on a DATE or TIMESTAMP column.
  • --time_partitioning_expiration: An integer that specifies (in seconds) when a time-based partition should be deleted. The expiration time evaluates to the partition's UTC date plus the integer value.
  • --time_partitioning_field: The DATE or TIMESTAMP column used to create a partitioned table. If time-based partitioning is enabled without this value, an ingestion-time partitioned table is created.
  • --require_partition_filter: When enabled, this option requires users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables.
  • --clustering_fields: A comma-separated list of up to four column names used to create a clustered table. This flag can only be used with partitioned tables.
  • --destination_kms_key: The Cloud KMS key for encryption of the table data.

    For more information on partitioned tables, see:

    For more information on clustered tables, see:

    For more information on table encryption, see:

To load CSV data into BigQuery, enter the following command:

bq --location=location load \
--source_format=format \
dataset.table \
path_to_source \
schema

Where:

  • location is your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • format is CSV.
  • dataset is an existing dataset.
  • table is the name of the table into which you're loading data.
  • path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
  • schema is a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

Examples:

The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset. The schema is defined in a local schema file named myschema.json.

    bq load \
    --source_format=CSV \
    mydataset.mytable \
    gs://mybucket/mydata.csv \
    ./myschema.json

The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset. The schema is defined in a local schema file named myschema.json. The CSV file includes two header rows. If --skip_leading_rows is unspecified, the default behavior is to assume the file does not contain headers.

    bq load \
    --source_format=CSV \
    --skip_leading_rows=2
    mydataset.mytable \
    gs://mybucket/mydata.csv \
    ./myschema.json

The following command loads data from gs://mybucket/mydata.csv into an ingestion-time partitioned table named mytable in mydataset. The schema is defined in a local schema file named myschema.json.

    bq load \
    --source_format=CSV \
    --time_partitioning_type=DAY \
    mydataset.mytable \
    gs://mybucket/mydata.csv \
    ./myschema.json

The following command loads data from gs://mybucket/mydata.csv into a partitioned table named mytable in mydataset. The table is partitioned on the mytimestamp column. The schema is defined in a local schema file named myschema.json.

    bq load \
    --source_format=CSV \
    --time_partitioning_field mytimestamp \
    mydataset.mytable \
    gs://mybucket/mydata.csv \
    ./myschema.json

The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset. The schema is auto detected.

    bq load \
    --autodetect \
    --source_format=CSV \
    mydataset.mytable \
    gs://mybucket/mydata.csv

The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset. The schema is defined inline in the format field:data_type, field:data_type.

    bq load \
    --source_format=CSV \
    mydataset.mytable \
    gs://mybucket/mydata.csv \
    qtr:STRING,sales:FLOAT,year:STRING

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The Cloud Storage URI uses a wildcard. The schema is auto detected.

    bq load \
    --autodetect \
    --source_format=CSV \
    mydataset.mytable \
    gs://mybucket/mydata*.csv

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The command includes a comma- separated list of Cloud Storage URIs with wildcards. The schema is defined in a local schema file named myschema.json.

    bq load \
    --source_format=CSV \
    mydataset.mytable \
    "gs://mybucket/00/*.csv","gs://mybucket/01/*.csv" \
    ./myschema.json

API

  1. Create a load job that points to the source data in Cloud Storage.

  2. (Optional) Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs property must be fully-qualified, in the format gs://bucket/object. Each URI can contain one '*' wildcard character.

  4. Specify the CSV data format by setting the sourceFormat property to CSV.

  5. To check the job status, call jobs.get(job_id*), where job_id is the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is loaded.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert on a given job ID is idempotent. You can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

C#

Before trying this sample, follow the C# setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery C# API reference documentation .


using Google.Cloud.BigQuery.V2;
using System;

public class BigQueryLoadTableGcsCsv
{
    public void LoadTableGcsCsv(
        string projectId = "your-project-id",
        string datasetId = "your_dataset_id"
    )
    {
        BigQueryClient client = BigQueryClient.Create(projectId);
        var gcsURI = "gs://cloud-samples-data/bigquery/us-states/us-states.csv";
        var dataset = client.GetDataset(datasetId);
        var schema = new TableSchemaBuilder {
            { "name", BigQueryDbType.String },
            { "post_abbr", BigQueryDbType.String }
        }.Build();
        var destinationTableRef = dataset.GetTableReference(
            tableId: "us_states");
        // Create job configuration
        var jobOptions = new CreateLoadJobOptions()
        {
            // The source format defaults to CSV; line below is optional.
            SourceFormat = FileFormat.Csv,
            SkipLeadingRows = 1
        };
        // Create and run job
        var loadJob = client.CreateLoadJob(
            sourceUri: gcsURI, destination: destinationTableRef,
            schema: schema, options: jobOptions);
        loadJob.PollUntilCompleted();  // Waits for the job to complete.
        // Display the number of rows uploaded
        BigQueryTable table = client.GetTable(destinationTableRef);
        Console.WriteLine(
            $"Loaded {table.Resource.NumRows} rows to {table.FullyQualifiedId}");
    }
}

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.csv")
gcsRef.SkipLeadingRows = 1
gcsRef.Schema = bigquery.Schema{
	{Name: "name", Type: bigquery.StringFieldType},
	{Name: "post_abbr", Type: bigquery.StringFieldType},
}
loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)
loader.WriteDisposition = bigquery.WriteEmpty

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}

if status.Err() != nil {
	return fmt.Errorf("Job completed with error: %v", status.Err())
}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Java API reference documentation .

Job job = table.load(FormatOptions.csv(), sourceUri);
// Wait for the job to complete
try {
  Job completedJob =
      job.waitFor(
          RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
          RetryOption.totalTimeout(Duration.ofMinutes(3)));
  if (completedJob != null && completedJob.getStatus().getError() == null) {
    // Job completed successfully
  } else {
    // Handle error case
  }
} catch (InterruptedException e) {
  // Handle interrupted wait
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Node.js API reference documentation .

// Import the Google Cloud client libraries
const {BigQuery} = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

// Instantiate clients
const bigquery = new BigQuery();
const storage = new Storage();

/**
 * This sample loads the CSV file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.csv';

async function loadCSVFromGCS() {
  // Imports a GCS file into a table with manually defined schema.

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const datasetId = 'my_dataset';
  // const tableId = 'my_table';

  // Configure the load job. For full list of options, see:
  // https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
  const metadata = {
    sourceFormat: 'CSV',
    skipLeadingRows: 1,
    schema: {
      fields: [
        {name: 'name', type: 'STRING'},
        {name: 'post_abbr', type: 'STRING'},
      ],
    },
    location: 'US',
  };

  // Load data from a Google Cloud Storage file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(storage.bucket(bucketName).file(filename), metadata);

  // load() waits for the job to finish
  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery PHP API reference documentation .

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId  = 'The Google project ID';
// $datasetId  = 'The BigQuery dataset ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$dataset = $bigQuery->dataset($datasetId);
$table = $dataset->table('us_states');

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.csv';
$schema = [
    'fields' => [
        ['name' => 'name', 'type' => 'string'],
        ['name' => 'post_abbr', 'type' => 'string']
    ]
];
$loadConfig = $table->loadFromStorage($gcsUri)->schema($schema)->skipLeadingRows(1);
$job = $table->runJob($loadConfig);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

Use the Client.load_table_from_uri() method to load data from a CSV file in Cloud Storage. Supply an explicit schema definition by setting the LoadJobConfig.schema property to a list of SchemaField objects.

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'

dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("post_abbr", "STRING"),
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"

load_job = client.load_table_from_uri(
    uri, dataset_ref.table("us_states"), job_config=job_config
)  # API request
print("Starting job {}".format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print("Job finished.")

destination_table = client.get_table(dataset_ref.table("us_states"))
print("Loaded {} rows.".format(destination_table.num_rows))

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Ruby API reference documentation .

require "google/cloud/bigquery"

def load_table_gcs_csv dataset_id = "your_dataset_id"
  bigquery = Google::Cloud::Bigquery.new
  dataset  = bigquery.dataset dataset_id
  gcs_uri  = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
  table_id = "us_states"

  load_job = dataset.load_job table_id, gcs_uri, skip_leading: 1 do |schema|
    schema.string "name"
    schema.string "post_abbr"
  end
  puts "Starting job #{load_job.job_id}"

  load_job.wait_until_done!  # Waits for table load to complete.
  puts "Job finished."

  table = dataset.table(table_id)
  puts "Loaded #{table.rows_count} rows to table #{table.id}"
end

Appending to or overwriting a table with CSV data

You can load additional data into a table either from source files or by appending query results.

In the console and the classic BigQuery web UI, you use the Write preference option to specify what action to take when you load data from a source file or from a query result.

You have the following options when you load additional data into a table:

Console option Classic web UI option CLI flag BigQuery API property Description
Write if empty Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table Append to table --noreplace or --replace=false; if --[no]replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

If you load data into an existing table, the load job can append the data or overwrite the table.

You can append or overwrite a table by:

  • Using the GCP Console or the classic web UI
  • Using the CLI's bq load command
  • Calling the jobs.insert API method and configuring a load job
  • Using the client libraries

Console

  1. Open the BigQuery web UI in the GCP Console.
    Go to the GCP Console

  2. In the navigation panel, in the Resources section, expand your project and select a dataset.

  3. On the right side of the window, in the details panel, click Create table. The process for appending and overwriting data in a load job is the same as the process for creating a table in a load job.

    Create table

  4. On the Create table page, in the Source section:

    • For Create table from, select Cloud Storage.

    • In the source field, browse to or enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're appending or overwriting.

      Select file

    • For File format, select CSV.

  5. On the Create table page, in the Destination section:

    • For Dataset name, choose the appropriate dataset.

      Select dataset

    • In the Table name field, enter the name of the table you're appending or overwriting in BigQuery.

    • Verify that Table type is set to Native table.

  6. In the Schema section, for Auto detect, check Schema and input parameters to enable schema auto detection. Alternatively, you can manually enter the schema definition by:

    • Enabling Edit as text and entering the table schema as a JSON array.

      Add schema as JSON array

    • Using Add field to manually input the schema.

      Add schema definition using the Add Field button

  7. For Partition and cluster settings, leave the default values. You cannot convert a table to a partitioned or clustered table by appending or overwriting it, and the GCP Console does not support appending to or overwriting partitioned or clustered tables in a load job.

  8. Click Advanced options.

    • For Write preference, choose Append to table or Overwrite table.
    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Unknown values, check Ignore unknown values to ignore any values in a row that are not present in the table's schema.
    • For Field delimiter, choose the character that separates the cells in your CSV file: Comma, Tab, Pipe, or Custom. If you choose Custom, enter the delimiter in the Custom field delimiter box. The default value is Comma.
    • For Header rows to skip, enter the number of header rows to skip at the top of the CSV file. The default value is 0.
    • For Quoted newlines, check Allow quoted newlines to allow quoted data sections that contain newline characters in a CSV file. The default value is false.
    • For Jagged rows, check Allow jagged rows to accept rows in CSV files that are missing trailing optional columns. The missing values are treated as nulls. If unchecked, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false.
    • For Encryption, click Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.

      Overwrite table

  9. Click Create table.

Classic UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for appending and overwriting data in a load job is the same as the process for creating a table in a load job.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're appending or overwriting.
    • For File format, select CSV.
  4. On the Create Table page, in the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're appending or overwriting.
    • Verify that Table type is set to Native table.
  5. In the Schema section, enter the schema definition.

    • For CSV files, you can check the Auto-detect option to enable schema auto-detection.

      auto detect link

    • You can also enter schema information manually by:

      • Clicking Edit as text and entering the table schema as a JSON array:

        Add schema as JSON array

      • Using Add Field to manually input the schema:

        Add schema using add fields

  6. In the Options section:

    • For Field delimiter, choose the character that separates the cells in your CSV file: Comma, Tab, Pipe, or Other. If you choose Other, enter the delimiter in the Custom field delimiter box. The default value is Comma.
    • For Header rows to skip, enter the number of header rows to skip at the top of the CSV file. The default value is 0.
    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Allow quoted newlines, check the box to allow quoted data sections that contain newline characters in a CSV file. The default value is false.
    • For Allow jagged rows, check the box to accept rows in CSV files that are missing trailing optional columns. The missing values are treated as nulls. If unchecked, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false.
    • For Ignore unknown values, check the box to ignore any values in a row that are not present in the table's schema.
    • For Write preference, choose Append to table or Overwrite table.
    • Leave the default values for Partitioning Type, Partitioning Field, Require partition filter, and Clustering Fields. You cannot convert a table to a partitioned or clustered table by appending or overwriting it, and the web UI does not support appending to or overwriting partitioned or clustered tables in a load job.
    • For Destination encryption, choose Customer-managed encryption to use a Cloud Key Management Service key to encrypt the table. If you leave the Default setting, BigQuery encrypts the data at rest using a Google-managed key.
  7. Click Create Table.

CLI

Use the bq load command, specify CSV using the --source_format flag, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs, or a URI containing a wildcard.

Supply the schema inline, in a schema definition file, or use schema auto-detect.

Specify the --replace flag to overwrite the table. Use the --noreplace flag to append data to the table. If no flag is specified, the default is to append data.

It is possible to modify the table's schema when you append or overwrite it. For more information on supported schema changes during a load operation, see Modifying table schemas.

(Optional) Supply the --location flag and set the value to your location.

Other optional flags include:

  • --allow_jagged_rows: When specified, accept rows in CSV files that are missing trailing optional columns. The missing values are treated as nulls. If unchecked, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false.
  • --allow_quoted_newlines: When specified, allows quoted data sections that contain newline characters in a CSV file. The default value is false.
  • --field_delimiter: The character that indicates the boundary between columns in the data. Both \t and tab are allowed for tab delimiters. The default value is ,.
  • --null_marker: An optional custom string that represents a NULL value in CSV data.
  • --skip_leading_rows: Specifies the number of header rows to skip at the top of the CSV file. The default value is 0.
  • --quote: The quote character to use to enclose records. The default value is ". To indicate no quote character, use an empty string.
  • --max_bad_records: An integer that specifies the maximum number of bad records allowed before the entire job fails. The default value is 0. At most, five errors of any type are returned regardless of the --max_bad_records value.
  • --ignore_unknown_values: When specified, allows and ignores extra, unrecognized values in CSV or JSON data.
  • --autodetect: When specified, enable schema auto-detection for CSV and JSON data.
  • --destination_kms_key: The Cloud KMS key for encryption of the table data.
bq --location=location load \
--[no]replace \
--source_format=format \
dataset.table \
path_to_source \
schema

where:

  • location is your location. The --location flag is optional. You can set a default value for the location using the .bigqueryrc file.
  • format is CSV.
  • dataset is an existing dataset.
  • table is the name of the table into which you're loading data.
  • path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
  • schema is a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

Examples:

The following command loads data from gs://mybucket/mydata.csv and overwrites a table named mytable in mydataset. The schema is defined using schema auto-detection.

    bq load \
    --autodetect \
    --replace \
    --source_format=CSV \
    mydataset.mytable \
    gs://mybucket/mydata.csv

The following command loads data from gs://mybucket/mydata.csv and appends data to a table named mytable in mydataset. The schema is defined using a JSON schema file — myschema.json.

    bq load \
    --noreplace \
    --source_format=CSV \
    mydataset.mytable \
    gs://mybucket/mydata.csv \
    ./myschema.json

API

  1. Create a load job that points to the source data in Cloud Storage.

  2. (Optional) Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs property must be fully-qualified, in the format gs://bucket/object. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported.

  4. Specify the data format by setting the configuration.load.sourceFormat property to CSV.

  5. Specify the write preference by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE or WRITE_APPEND.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.csv")
gcsRef.SourceFormat = bigquery.CSV
gcsRef.AutoDetect = true
gcsRef.SkipLeadingRows = 1
loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)
loader.WriteDisposition = bigquery.WriteTruncate

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}

if status.Err() != nil {
	return fmt.Errorf("job completed with error: %v", status.Err())
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Node.js API reference documentation .

To replace the rows in an existing table, set the writeDisposition value in the metadata parameter to 'WRITE_TRUNCATE'.

// Import the Google Cloud client libraries
const {BigQuery} = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

// Instantiate clients
const bigquery = new BigQuery();
const storage = new Storage();

/**
 * This sample loads the CSV file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.csv';

async function loadCSVFromGCSTruncate() {
  /**
   * Imports a GCS file into a table and overwrites
   * table data if table already exists.
   */

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const datasetId = 'my_dataset';
  // const tableId = 'my_table';

  // Configure the load job. For full list of options, see:
  // https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
  const metadata = {
    sourceFormat: 'CSV',
    skipLeadingRows: 1,
    schema: {
      fields: [
        {name: 'name', type: 'STRING'},
        {name: 'post_abbr', type: 'STRING'},
      ],
    },
    // Set the write disposition to overwrite existing table data.
    writeDisposition: 'WRITE_TRUNCATE',
    location: 'US',
  };

  // Load data from a Google Cloud Storage file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(storage.bucket(bucketName).file(filename), metadata);
  // load() waits for the job to finish
  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}

Before trying this sample, follow the PHP setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery PHP API reference documentation .

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId = 'The Google project ID';
// $datasetId = 'The BigQuery dataset ID';
// $tableId = 'The BigQuery table ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$table = $bigQuery->dataset($datasetId)->table($tableId);

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.csv';
$loadConfig = $table->loadFromStorage($gcsUri)->skipLeadingRows(1)->writeDisposition('WRITE_TRUNCATE');
$job = $table->runJob($loadConfig);

// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});

// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

To replace the rows in an existing table, set the LoadJobConfig.write_disposition property to the SourceFormat constant WRITE_TRUNCATE.

# from google.cloud import bigquery
# client = bigquery.Client()
# table_ref = client.dataset('my_dataset').table('existing_table')

job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
load_job = client.load_table_from_uri(
    uri, table_ref, job_config=job_config
)  # API request
print("Starting job {}".format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print("Job finished.")

destination_table = client.get_table(table_ref)
print("Loaded {} rows.".format(destination_table.num_rows))

CSV options

To change how BigQuery parses CSV data, specify additional options in the console, the classic UI, the CLI, or the API.

For more information on the CSV format, see RFC 4180.

CSV option Console option Classic UI option CLI flag BigQuery API property Description
Field delimiter Field delimiter: Comma, Tab, Pipe, Custom Field delimiter: Comma, Tab, Pipe, Other -F or --field_delimiter fieldDelimiter (Optional) The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (`,`).
Header rows Header rows to skip Header rows to skip --skip_leading_rows skipLeadingRows (Optional) An integer indicating the number of header rows in the source data.
Number of bad records allowed Number of errors allowed Number of errors allowed --max_bad_records maxBadRecords (Optional) The maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an invalid error is returned in the job result. The default value is 0, which requires that all records are valid.
Newline characters Allow quoted newlines Allow quoted newlines --allow_quoted_newlines allowQuotedNewlines (Optional) Indicates whether to allow quoted data sections that contain newline characters in a CSV file. The default value is false.
Custom null values None None --null_marker nullMarker (Optional) Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string. If you set this property to a custom value, BigQuery throws an error if an empty string is present for all data types except for STRING and BYTE. For STRING and BYTE columns, BigQuery interprets the empty string as an empty value.
Trailing optional columns Allow jagged rows Allow jagged rows --allow_jagged_rows allowJaggedRows (Optional) Accept rows that are missing trailing optional columns. The missing values are treated as nulls. If false, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. Only applicable to CSV, ignored for other formats.
Unknown values Ignore unknown values Ignore unknown values --ignore_unknown_values ignoreUnknownValues (Optional) Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value:
  • CSV: Trailing columns
  • JSON: Named values that don't match any column names
Quote None None --quote quote (Optional) The value that is used to quote data sections in a CSV file. BigQuery converts the string to ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data in its raw, binary state. The default value is a double-quote ('"'). If your data does not contain quoted sections, set the property value to an empty string. If your data contains quoted newline characters, you must also set the allowQuotedNewlines property to true.
Encoding None None -E or --encoding encoding (Optional) The character encoding of the data. The supported values are UTF-8 or ISO-8859-1. The default value is UTF-8. BigQuery decodes the data after the raw, binary data has been split using the values of the quote and fieldDelimiter properties.
Оцените, насколько информация на этой странице была вам полезна:

Оставить отзыв о...

Текущей странице
Нужна помощь? Обратитесь в службу поддержки.