Loading JSON data from Cloud Storage

Loading JSON files from Cloud Storage

When you load newline delimited JSON data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.

The newline delimited JSON format is the same format as the JSON Lines format.

For information about loading JSON data from a local file, see Loading data into BigQuery from a local data source.

Limitations

When you load JSON data from Cloud Storage into BigQuery, note the following:

  • JSON data must be newline delimited. Each JSON object must be on a separate line in the file.
  • If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.
  • You cannot include both compressed and uncompressed files in the same load job.
  • BigQuery does not support maps or dictionaries in JSON. For example, "product_categories": {"my_product": 40.0} is not valid, but "product_categories": {"column1": "my_product" , "column2": 40.0} is valid.
  • When you load CSV or JSON data, values in DATE columns must use the dash (-) separator and the date must be in the following format: YYYY-MM-DD (year-month-day).
  • When you load JSON or CSV data, values in TIMESTAMP columns must use a dash (-) separator for the date portion of the timestamp, and the date must be in the following format: YYYY-MM-DD (year-month-day). The hh:mm:ss (hour-minute-second) portion of the timestamp must use a colon (:) separator.

Required permissions

When you load data into BigQuery, you need permissions to run a load job and permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need permissions to access to the bucket that contains your data.

BigQuery permissions

At a minimum, the following permissions are required to load data into BigQuery. These permissions are required if you are loading data into a new table or partition, or if you are appending or overwriting a table or partition.

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.jobs.create

The following predefined Cloud IAM roles include both bigquery.tables.create and bigquery.tables.updateData permissions:

  • bigquery.dataEditor
  • bigquery.dataOwner
  • bigquery.admin

The following predefined Cloud IAM roles include bigquery.jobs.create permissions:

  • bigquery.user
  • bigquery.jobUser
  • bigquery.admin

In addition, if a user has bigquery.datasets.create permissions, when that user creates a dataset, they are granted bigquery.dataOwner access to it. bigquery.dataOwner access gives the user the ability to create and update tables in the dataset via a load job.

For more information on Cloud IAM roles and permissions in BigQuery, see Access control.

Cloud Storage permissions

In order to load data from a Cloud Storage bucket, you must be granted storage.objects.get permissions. If you are using a URI wildcard, you must also have storage.objects.list permissions.

The predefined Cloud IAM role storage.objectViewer can be granted to provide both storage.objects.get and storage.objects.list permissions.

Loading JSON data into a new table

You can load newline delimited JSON data from Cloud Storage into a new BigQuery table by:

  • Using the GCP Console or the classic web UI
  • Using the CLI's bq load command
  • Calling the jobs.insert API method and configuring a load job
  • Using the client libraries

To load JSON data from Cloud Storage into a new BigQuery table:

Console

  1. Open the BigQuery web UI in the GCP Console.
    Go to the GCP Console

  2. In the navigation panel, in the Resources section, expand your project and select a dataset.

  3. On the right side of the window, in the details panel, click Create table. The process for loading data is the same as the process for creating an empty table.

    Create table

  4. On the Create table page, in the Source section:

    • For Create table from, select Cloud Storage.

    • In the source field, browse to or enter the Cloud Storage URI. Note that you cannot include multiple URIs in the GCP Console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.

      Select file

    • For File format, select JSON (Newline delimited).

  5. On the Create table page, in the Destination section:

    • For Dataset name, choose the appropriate dataset.

      View dataset

    • Verify that Table type is set to Native table.

    • In the Table name field, enter the name of the table you're creating in BigQuery.

  6. In the Schema section, for Auto detect, check Schema and input parameters to enable schema auto detection. Alternatively, you can manually enter the schema definition by:

    • Enabling Edit as text and entering the table schema as a JSON array.

      Add schema as JSON array

    • Using Add field to manually input the schema.

      Add schema definition using the Add Field button

  7. (Optional) To partition the table, choose your options in the Partition and cluster settings:

    • To create a partitioned table, click No partitioning, select Partition by field and choose a DATE or TIMESTAMP column. This option is unavailable if your schema does not include a DATE or TIMESTAMP column.
    • To create an ingestion-time partitioned table, click No partitioning and select Partition by ingestion time.
  8. (Optional) For Partitioning filter, click the Require partition filter box to require users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables. This option is unavailable if No partitioning is selected.

  9. (Optional) To cluster the table, in the Clustering order box, enter between one and four field names. Currently, clustering is supported only for partitioned tables.

  10. (Optional) Click Advanced options.

    • For Write preference, leave Write if empty selected. This option creates a new table and loads your data into it.
    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Unknown values, check Ignore unknown values to ignore any values in a row that are not present in the table's schema.
    • For Encryption, click Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
  11. Click Create table.

Classic UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • Click Create from source.
    • For Location, select Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.
    • For File format, select JSON (Newline delimited).
  4. In the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  5. In the Schema section, for Auto detect, check Schema and input parameters to enable schema auto detection. Alternatively, you can manually enter the schema definition by:

    • Clicking Edit as text and entering the table schema as a JSON array.

      Add schema as JSON array

    • Using Add Field to manually input the schema:

      Add schema using add fields

  6. (Optional) In the Options section:

    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Write preference, leave Write if empty selected. This option creates a new table and loads your data into it.
    • To partition the table:
      • For Partitioning Type, click None and choose Day.
      • For Partitioning Field:
      • To create a partitioned table, choose a DATE or TIMESTAMP column. This option is unavailable if your schema does not include a DATE or TIMESTAMP column.
      • To create an ingestion-time partitioned table, leave the default value: _PARTITIONTIME.
      • Click the Require partition filter box to require users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables. This option is unavailable if Partitioning type is set to None.
    • To cluster the table, in the Clustering fields box, enter between one and four field names.
    • For Destination encryption, choose Customer-managed encryption to use a Cloud Key Management Service key to encrypt the table. If you leave the Default setting, BigQuery encrypts the data at rest using a Google-managed key.
  7. Click Create Table.

CLI

Use the bq load command, specify NEWLINE_DELIMITED_JSON using the --source_format flag, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs, or a URI containing a wildcard. Supply the schema inline, in a schema definition file, or use schema auto-detect.

(Optional) Supply the --location flag and set the value to your location.

Other optional flags include:

  • --max_bad_records: An integer that specifies the maximum number of bad records allowed before the entire job fails. The default value is 0. At most, five errors of any type are returned regardless of the --max_bad_records value.
  • --ignore_unknown_values: When specified, allows and ignores extra, unrecognized values in CSV or JSON data.
  • --autodetect: When specified, enable schema auto-detection for CSV and JSON data.
  • --quote: The quote character to use to enclose records. The default value is ". To indicate no quote character, use an empty string.
  • --time_partitioning_type: Enables time-based partitioning on a table and sets the partition type. Currently, the only possible value is DAY which generates one partition per day. This flag is optional when you create a table partitioned on a DATE or TIMESTAMP column.
  • --time_partitioning_expiration: An integer that specifies (in seconds) when a time-based partition should be deleted. The expiration time evaluates to the partition's UTC date plus the integer value.
  • --time_partitioning_field: The DATE or TIMESTAMP column used to create a partitioned table. If time-based partitioning is enabled without this value, an ingestion-time partitioned table is created.
  • --require_partition_filter: When enabled, this option requires users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables.
  • --clustering_fields: A comma-separated list of up to four column names used to create a clustered table. This flag can only be used with partitioned tables.
  • --destination_kms_key: The Cloud KMS key for encryption of the table data.

    For more information on partitioned tables, see:

    For more information on clustered tables, see:

    For more information on table encryption, see:

To load JSON data into BigQuery, enter the following command:

bq --location=location load \
--source_format=format \
dataset.table \
path_to_source \
schema

Where:

  • location is your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • format is NEWLINE_DELIMITED_JSON.
  • dataset is an existing dataset.
  • table is the name of the table into which you're loading data.
  • path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
  • schema is a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

Examples:

The following command loads data from gs://mybucket/mydata.json into a table named mytable in mydataset. The schema is defined in a local schema file named myschema.json.

    bq load \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://mybucket/mydata.json \
    ./myschema.json

The following command loads data from gs://mybucket/mydata.json into an ingestion-time partitioned table named mytable in mydataset. The schema is defined in a local schema file named myschema.json.

    bq load \
    --source_format=NEWLINE_DELIMITED_JSON \
    --time_partitioning_type=DAY \
    mydataset.mytable \
    gs://mybucket/mydata.json \
    ./myschema.json

The following command loads data from gs://mybucket/mydata.json into a partitioned table named mytable in mydataset. The table is partitioned on the mytimestamp column. The schema is defined in a local schema file named myschema.json.

    bq load \
    --source_format=NEWLINE_DELIMITED_JSON \
    --time_partitioning_field mytimestamp \
    mydataset.mytable \
    gs://mybucket/mydata.json \
    ./myschema.json

The following command loads data from gs://mybucket/mydata.json into a table named mytable in mydataset. The schema is auto detected.

    bq load \
    --autodetect \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://mybucket/mydata.json

The following command loads data from gs://mybucket/mydata.json into a table named mytable in mydataset. The schema is defined inline in the format field:data_type, field:data_type.

    bq load \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://mybucket/mydata.json \
    qtr:STRING,sales:FLOAT,year:STRING

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The Cloud Storage URI uses a wildcard. The schema is auto detected.

    bq load \
    --autodetect \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://mybucket/mydata*.json

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The command includes a comma- separated list of Cloud Storage URIs with wildcards. The schema is defined in a local schema file named myschema.json.

    bq load \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    "gs://mybucket/00/*.json","gs://mybucket/01/*.json" \
    ./myschema.json

API

  1. Create a load job that points to the source data in Cloud Storage.

  2. (Optional) Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs property must be fully-qualified, in the format gs://bucket/object. Each URI can contain one '*' wildcard character.

  4. Specify the JSON data format by setting the sourceFormat property to NEWLINE_DELIMITED_JSON.

  5. To check the job status, call jobs.get(job_id*), where job_id is the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is loaded.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert on a given job ID is idempotent. You can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

C#

Before trying this sample, follow the C# setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery C# API reference documentation .

Use the BigQueryClient.CreateLoadJob() method to start a load job from Cloud Storage. To use newline-delimited JSON, create a CreateLoadJobOptions object and set its SourceFormat property to FileFormat.NewlineDelimitedJson.


using Google.Apis.Bigquery.v2.Data;
using Google.Cloud.BigQuery.V2;
using System;

public class BigQueryLoadTableGcsJson
{
    public void LoadTableGcsJson(
        string projectId = "your-project-id",
        string datasetId = "your_dataset_id"
    )
    {
        BigQueryClient client = BigQueryClient.Create(projectId);
        var gcsURI = "gs://cloud-samples-data/bigquery/us-states/us-states.json";
        var dataset = client.GetDataset(datasetId);
        var schema = new TableSchemaBuilder {
            { "name", BigQueryDbType.String },
            { "post_abbr", BigQueryDbType.String }
        }.Build();
        TableReference destinationTableRef = dataset.GetTableReference(
            tableId: "us_states");
        // Create job configuration
        var jobOptions = new CreateLoadJobOptions()
        {
            SourceFormat = FileFormat.NewlineDelimitedJson
        };
        // Create and run job
        BigQueryJob loadJob = client.CreateLoadJob(
            sourceUri: gcsURI, destination: destinationTableRef,
            schema: schema, options: jobOptions);
        loadJob.PollUntilCompleted();  // Waits for the job to complete.
        // Display the number of rows uploaded
        BigQueryTable table = client.GetTable(destinationTableRef);
        Console.WriteLine(
            $"Loaded {table.Resource.NumRows} rows to {table.FullyQualifiedId}");
    }
}

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.json")
gcsRef.SourceFormat = bigquery.JSON
gcsRef.Schema = bigquery.Schema{
	{Name: "name", Type: bigquery.StringFieldType},
	{Name: "post_abbr", Type: bigquery.StringFieldType},
}
loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)
loader.WriteDisposition = bigquery.WriteEmpty

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}

if status.Err() != nil {
	return fmt.Errorf("Job completed with error: %v", status.Err())
}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Java API reference documentation .

Use the LoadJobConfiguration.builder(tableId, sourceUri) method to start a load job from Cloud Storage. To use newline-delimited JSON, use the LoadJobConfiguration.setFormatOptions(FormatOptions.json()).

String sourceUri = "gs://cloud-samples-data/bigquery/us-states/us-states.json";
TableId tableId = TableId.of(datasetName, tableName);
// Table field definition
Field[] fields =
    new Field[] {
      Field.of("name", LegacySQLTypeName.STRING),
      Field.of("post_abbr", LegacySQLTypeName.STRING)
    };
// Table schema definition
Schema schema = Schema.of(fields);
LoadJobConfiguration configuration =
    LoadJobConfiguration.builder(tableId, sourceUri)
        .setFormatOptions(FormatOptions.json())
        .setCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
        .setSchema(schema)
        .build();
// Load the table
Job loadJob = bigquery.create(JobInfo.of(configuration));
loadJob = loadJob.waitFor();
// Check the table
System.out.println("State: " + loadJob.getStatus().getState());
return ((StandardTableDefinition) bigquery.getTable(tableId).getDefinition()).getNumRows();

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Node.js API reference documentation .

// Import the Google Cloud client libraries
const {BigQuery} = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

// Instantiate clients
const bigquery = new BigQuery();
const storage = new Storage();

/**
 * This sample loads the json file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.json
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.json';

async function loadJSONFromGCS() {
  // Imports a GCS file into a table with manually defined schema.

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const datasetId = "my_dataset";
  // const tableId = "my_table";

  // Configure the load job. For full list of options, see:
  // https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
  const metadata = {
    sourceFormat: 'NEWLINE_DELIMITED_JSON',
    schema: {
      fields: [
        {name: 'name', type: 'STRING'},
        {name: 'post_abbr', type: 'STRING'},
      ],
    },
    location: 'US',
  };

  // Load data from a Google Cloud Storage file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(storage.bucket(bucketName).file(filename), metadata);
  // load() waits for the job to finish
  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery PHP API reference documentation .

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId  = 'The Google project ID';
// $datasetId  = 'The BigQuery dataset ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$dataset = $bigQuery->dataset($datasetId);
$table = $dataset->table('us_states');

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.json';
$schema = [
    'fields' => [
        ['name' => 'name', 'type' => 'string'],
        ['name' => 'post_abbr', 'type' => 'string']
    ]
];
$loadConfig = $table->loadFromStorage($gcsUri)->schema($schema)->sourceFormat('NEWLINE_DELIMITED_JSON');
$job = $table->runJob($loadConfig);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

Use the Client.load_table_from_uri() method to start a load job from Cloud Storage. To use newline-delimited JSON, set the LoadJobConfig.source_format property to the string NEWLINE_DELIMITED_JSON and pass the job config as the job_config argument to the load_table_from_uri() method.

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'

dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("post_abbr", "STRING"),
]
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.json"

load_job = client.load_table_from_uri(
    uri,
    dataset_ref.table("us_states"),
    location="US",  # Location must match that of the destination dataset.
    job_config=job_config,
)  # API request
print("Starting job {}".format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print("Job finished.")

destination_table = client.get_table(dataset_ref.table("us_states"))
print("Loaded {} rows.".format(destination_table.num_rows))

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Ruby API reference documentation .

Use the Dataset.load_job() method to start a load job from Cloud Storage. To use newline-delimited JSON, set the format parameter to "json".

require "google/cloud/bigquery"

def load_table_gcs_json dataset_id = "your_dataset_id"
  bigquery = Google::Cloud::Bigquery.new
  dataset  = bigquery.dataset dataset_id
  gcs_uri  = "gs://cloud-samples-data/bigquery/us-states/us-states.json"
  table_id = "us_states"

  load_job = dataset.load_job table_id, gcs_uri, format: "json" do |schema|
    schema.string "name"
    schema.string "post_abbr"
  end
  puts "Starting job #{load_job.job_id}"

  load_job.wait_until_done!  # Waits for table load to complete.
  puts "Job finished."

  table = dataset.table(table_id)
  puts "Loaded #{table.rows_count} rows to table #{table.id}"
end

Loading nested and repeated JSON data

BigQuery supports loading nested and repeated data from source formats that support object-based schemas, such as JSON, Avro, ORC, Parquet, Cloud Firestore, and Cloud Datastore.

One JSON object, including any nested/repeated fields, must appear on each line.

The following example shows sample nested/repeated data. This table contains information about people. It consists of the following fields:

  • id
  • first_name
  • last_name
  • dob (date of birth)
  • addresses (a nested and repeated field)
    • addresses.status (current or previous)
    • addresses.address
    • addresses.city
    • addresses.state
    • addresses.zip
    • addresses.numberOfYears (years at the address)

The JSON data file would look like the following. Notice that the address field contains an array of values (indicated by [ ]).

{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
{"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}

The schema for this table would look like the following:

[
    {
        "name": "id",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "first_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "last_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "dob",
        "type": "DATE",
        "mode": "NULLABLE"
    },
    {
        "name": "addresses",
        "type": "RECORD",
        "mode": "REPEATED",
        "fields": [
            {
                "name": "status",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "address",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "city",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "state",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "zip",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "numberOfYears",
                "type": "STRING",
                "mode": "NULLABLE"
            }
        ]
    }
]

For information on specifying a nested and repeated schema, see Specifying nested and repeated fields.

Appending to or overwriting a table with JSON data

You can load additional data into a table either from source files or by appending query results.

In the console and the classic BigQuery web UI, you use the Write preference option to specify what action to take when you load data from a source file or from a query result.

You have the following options when you load additional data into a table:

Console option Classic web UI option CLI flag BigQuery API property Description
Write if empty Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table Append to table --noreplace or --replace=false; if --[no]replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

If you load data into an existing table, the load job can append the data or overwrite the table.

You can append or overwrite a table by:

  • Using the GCP Console or the classic web UI
  • Using the CLI's bq load command
  • Calling the jobs.insert API method and configuring a load job
  • Using the client libraries

Console

  1. Open the BigQuery web UI in the GCP Console.
    Go to the GCP Console

  2. In the navigation panel, in the Resources section, expand your project and select a dataset.

  3. On the right side of the window, in the details panel, click Create table. The process for appending and overwriting data in a load job is the same as the process for creating a table in a load job.

    Create table

  4. On the Create table page, in the Source section:

    • For Create table from, select Cloud Storage.

    • In the source field, browse to or enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're appending or overwriting.

      Select file

    • For File format, select JSON (Newline delimited).

  5. On the Create table page, in the Destination section:

    • For Dataset name, choose the appropriate dataset.

      Select dataset

    • In the Table name field, enter the name of the table you're appending or overwriting in BigQuery.

    • Verify that Table type is set to Native table.

  6. In the Schema section, for Auto detect, check Schema and input parameters to enable schema auto detection. Alternatively, you can manually enter the schema definition by:

    • Enabling Edit as text and entering the table schema as a JSON array.

      Add schema as JSON array

    • Using Add field to manually input the schema.

      Add schema definition using the Add Field button

  7. For Partition and cluster settings, leave the default values. You cannot convert a table to a partitioned or clustered table by appending or overwriting it, and the GCP Console does not support appending to or overwriting partitioned or clustered tables in a load job.

  8. Click Advanced options.

    • For Write preference, choose Append to table or Overwrite table.
    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Unknown values, check Ignore unknown values to ignore any values in a row that are not present in the table's schema.
    • For Encryption, click Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.

      Overwrite table

  9. Click Create table.

Classic UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for appending and overwriting data in a load job is the same as the process for creating a table in a load job.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're appending or overwriting.
    • For File format, select JSON (Newline Delimited).
  4. On the Create Table page, in the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're appending or overwriting.
    • Verify that Table type is set to Native table.
  5. In the Schema section, enter the schema definition.

    • For JSON files, you can check the Auto-detect option to enable schema auto-detection.

      auto detect link

    • You can also enter schema information manually by:

      • Clicking Edit as text and entering the table schema as a JSON array:

        Add schema as JSON array

      • Using Add Field to manually input the schema:

        Add schema using add fields

  6. In the Options section:

    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Write preference, choose Append to table or Overwrite table.
    • Leave the default values for Partitioning Type, Partitioning Field, Require partition filter, and Clustering Fields. You cannot convert a table to a partitioned or clustered table by appending or overwriting it, and the web UI does not support appending to or overwriting partitioned or clustered tables in a load job.
    • For Destination encryption, choose Customer-managed encryption to use a Cloud Key Management Service key to encrypt the table. If you leave the Default setting, BigQuery encrypts the data at rest using a Google-managed key.
  7. Click Create Table.

CLI

Use the bq load command, specify NEWLINE_DELIMITED_JSON using the --source_format flag, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs, or a URI containing a wildcard.

Supply the schema inline, in a schema definition file, or use schema auto-detect.

Specify the --replace flag to overwrite the table. Use the --noreplace flag to append data to the table. If no flag is specified, the default is to append data.

It is possible to modify the table's schema when you append or overwrite it. For more information on supported schema changes during a load operation, see Modifying table schemas.

(Optional) Supply the --location flag and set the value to your location.

Other optional flags include:

  • --max_bad_records: An integer that specifies the maximum number of bad records allowed before the entire job fails. The default value is 0. At most, five errors of any type are returned regardless of the --max_bad_records value.
  • --ignore_unknown_values: When specified, allows and ignores extra, unrecognized values in CSV or JSON data.
  • --autodetect: When specified, enable schema auto-detection for CSV and JSON data.
  • --destination_kms_key: The Cloud KMS key for encryption of the table data.
bq --location=location load \
--[no]replace \
--source_format=format \
dataset.table \
path_to_source \
schema

where:

  • location is your location. The --location flag is optional. You can set a default value for the location using the .bigqueryrc file.
  • format is NEWLINE_DELIMITED_JSON.
  • dataset is an existing dataset.
  • table is the name of the table into which you're loading data.
  • path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
  • schema is a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

Examples:

The following command loads data from gs://mybucket/mydata.json and overwrites a table named mytable in mydataset. The schema is defined using schema auto-detection.

    bq load \
    --autodetect \
    --replace \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://mybucket/mydata.json

The following command loads data from gs://mybucket/mydata.json and appends data to a table named mytable in mydataset. The schema is defined using a JSON schema file — myschema.json.

    bq load \
    --noreplace \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://mybucket/mydata.json \
    ./myschema.json

API

  1. Create a load job that points to the source data in Cloud Storage.

  2. (Optional) Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs property must be fully-qualified, in the format gs://bucket/object. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported.

  4. Specify the data format by setting the configuration.load.sourceFormat property to NEWLINE_DELIMITED_JSON.

  5. Specify the write preference by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE or WRITE_APPEND.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.json")
gcsRef.SourceFormat = bigquery.JSON
gcsRef.AutoDetect = true
loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)
loader.WriteDisposition = bigquery.WriteTruncate

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}

if status.Err() != nil {
	return fmt.Errorf("job completed with error: %v", status.Err())
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Node.js API reference documentation .

// Import the Google Cloud client libraries
const {BigQuery} = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

// Instantiate clients
const bigquery = new BigQuery();
const storage = new Storage();

/**
 * This sample loads the JSON file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.json
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.json';

async function loadJSONFromGCSTruncate() {
  /**
   * Imports a GCS file into a table and overwrites
   * table data if table already exists.
   */

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const datasetId = "my_dataset";
  // const tableId = "my_table";

  // Configure the load job. For full list of options, see:
  // https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
  const metadata = {
    sourceFormat: 'NEWLINE_DELIMITED_JSON',
    schema: {
      fields: [
        {name: 'name', type: 'STRING'},
        {name: 'post_abbr', type: 'STRING'},
      ],
    },
    // Set the write disposition to overwrite existing table data.
    writeDisposition: 'WRITE_TRUNCATE',
  };

  // Load data from a Google Cloud Storage file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(storage.bucket(bucketName).file(filename), metadata);
  // load() waits for the job to finish
  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery PHP API reference documentation .

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId = 'The Google project ID';
// $datasetId = 'The BigQuery dataset ID';
// $tableID = 'The BigQuery table ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$table = $bigQuery->dataset($datasetId)->table($tableId);

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.json';
$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('NEWLINE_DELIMITED_JSON')->writeDisposition('WRITE_TRUNCATE');
$job = $table->runJob($loadConfig);

// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});

// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

To replace the rows in an existing table, set the LoadJobConfig.write_disposition property to the string WRITE_TRUNCATE.

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

# from google.cloud import bigquery
# client = bigquery.Client()
# table_ref = client.dataset('my_dataset').table('existing_table')

job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.json"
load_job = client.load_table_from_uri(
    uri, table_ref, job_config=job_config
)  # API request
print("Starting job {}".format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print("Job finished.")

destination_table = client.get_table(table_ref)
print("Loaded {} rows.".format(destination_table.num_rows))

Ruby

To replace the rows in an existing table, set the write parameter of Table.load_job() to "WRITE_TRUNCATE".

Before trying this sample, follow the Ruby setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Ruby API reference documentation .

require "google/cloud/bigquery"

def load_table_gcs_json_truncate(
    dataset_id = "your_dataset_id",
    table_id   = "your_table_id"
  )
  bigquery = Google::Cloud::Bigquery.new
  dataset  = bigquery.dataset dataset_id
  gcs_uri  = "gs://cloud-samples-data/bigquery/us-states/us-states.json"

  load_job = dataset.load_job table_id,
                              gcs_uri,
                              format: "json",
                              write:  "truncate"
  puts "Starting job #{load_job.job_id}"

  load_job.wait_until_done!  # Waits for table load to complete.
  puts "Job finished."

  table = dataset.table(table_id)
  puts "Loaded #{table.rows_count} rows to table #{table.id}"
end

JSON options

To change how BigQuery parses JSON data, specify additional options in the console, the classic UI, the CLI, the API, or the client libraries.

JSON option Console option Classic UI option CLI flag BigQuery API property Description
Number of bad records allowed Number of errors allowed Number of errors allowed --max_bad_records maxBadRecords (Optional) The maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an invalid error is returned in the job result. The default value is 0, which requires that all records are valid.
Unknown values Ignore unknown values Ignore unknown values --ignore_unknown_values ignoreUnknownValues (Optional) Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value: CSV: Trailing columns JSON: Named values that don't match any column names.
Oliko tästä sivusta apua? Kerro mielipiteesi

Palautteen aihe:

Tämä sivu
BigQuery
Tarvitsetko apua? Siirry tukisivullemme.