Loading JSON Data from Google Cloud Storage

Loading JSON files from Cloud Storage

When you load newline delimited JSON data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

For information about loading JSON data from a local file, see Loading data into BigQuery from a local data source.

Limitations

When you load JSON data from Cloud Storage into BigQuery, note the following:

  • JSON data must be newline delimited
  • If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.
  • BigQuery does not support maps or dictionaries in JSON. For example, "product_categories": {"my_product": 40.0} is not valid, but "product_categories": {"column1": "my_product" , "column2": 40.0} is valid.

Required permissions

When you load data into BigQuery, you need project or dataset-level permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need access to the bucket that contains your data.

BigQuery permissions

When you are loading data into BigQuery from Cloud Storage, you must be granted the bigquery.dataOwner or bigquery.dataEditor role at the project level or at the dataset level. Both roles grant users and groups permission to load data into a new table or to append to or overwrite an existing table.

Granting the roles at the project level gives the user or group permission to load data into tables in every dataset in the project. Granting the roles at the dataset level gives the user or group the ability to load data only into tables in that dataset.

For more information on configuring dataset access, see Assigning access controls to datasets. For more information on IAM roles in BigQuery, see Access Control.

Cloud Storage permissions

In order to load data from a Cloud Storage bucket, you must be granted storage.objects.get permissions at the project level or on that individual bucket. If you are using a URI wildcard, you must also have storage.objects.list permissions.

The predefined IAM role storage.objectViewer can be granted to provide storage.objects.getand storage.objects.list permissions.

Loading JSON data into a new table

To load newline delimited JSON data from Google Cloud Storage into a new BigQuery table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported.
    • For File format, select JSON (Newline Delimited).
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  5. In the Schema section, enter the schema definition.

    • For JSON files, you can check the Auto-detect option to enable schema auto-detection.

      auto detect link

    • You can also enter schema information manually by:

      • Clicking Edit as text and entering the table schema as a JSON array:

        Add schema as JSON array

      • Using Add Field to manually input the schema:

        Add schema using add fields

  6. Select applicable items in the Options section and then click Create Table. For information on the available options, see JSON options.

Command-line

Use the bq load command, specify NEWLINE_DELIMITED_JSON as the source_format, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs or a URI containing a wildcard:

bq load --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA]

where:

  • [FORMAT] is NEWLINE_DELIMITED_JSON.
  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
  • [SCHEMA] is a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

In addition, you can add flags for JSON options that allow you to control how BigQuery parses your data.

Examples:

  • The following command loads data from gs://mybucket/mydata.json into a table named mytable in mydataset. The schema is defined in a local schema file named myschema.json:

    bq load --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable gs://mybucket/mydata.json ./myschema.json
    
  • The following command loads data from gs://mybucket/mydata.json into a table named mytable in mydataset. The schema is defined inline in the format [FIELD]:[DATA_TYPE], [FIELD]:[DATA_TYPE]:

    bq load --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable gs://mybucket/mydata.json qtr:STRING,sales:FLOAT,year:STRING
    

    Note: When you specify the schema on the command line, you cannot include a RECORD (STRUCT) type, you cannot include a field description, and you cannot specify the field mode. All field modes default to NULLABLE. To include field descriptions, modes, and RECORD types, supply a JSON schema file instead.

  • The following command loads data from gs://mybucket/mydata.json into a table named mytable in mydataset. The schema is defined using schema auto-detection:

    bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable gs://mybucket/mydata.json
    
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The Cloud Storage URI uses a wildcard, and the schema is defined using schema auto-detection:

    bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable gs://mybucket/mydata*.json
    
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The command includes a comma-separated list of Cloud Storage URIs, and the schema is defined using schema auto-detection:

    bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable "gs://mybucket/myfile.json,gs://mybucket/myfile2.json"
    

API

Set the following properties to load newline delimited JSON data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. Each URI can contain one '*' wildcard character.

  3. Specify the CSV data format by setting the configuration.load.sourceFormat property to NEWLINE_DELIMITED_JSON.

  4. To check the job status, call jobs.get([JOB_ID]*), where [JOB_ID] is the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is added.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert() to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

Loading nested and repeated JSON data

BigQuery supports loading nested and repeated data from source formats that support object-based schemas, such as JSON, Avro, and Cloud Datastore backup files.

One JSON object, including any nested/repeated fields, must appear on each line.

Example

The following example shows sample nested/repeated data. This table contains information about people. It consists of the following fields:

  • id
  • first_name
  • last_name
  • dob (date of birth)
  • addresses (a nested and repeated field)
    • addresses.status (current or previous)
    • addresses.address
    • addresses.city
    • addresses.state
    • addresses.zip
    • addresses.numberOfYears (years at the address)

The JSON data file would look like the following. Notice that the address field contains an array of values (indicated by [ ]).

{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
{"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}

The schema for this table would look like the following:

[
    {
        "name": "id",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "first_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "last_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "dob",
        "type": "DATE",
        "mode": "NULLABLE"
    },
    {
        "name": "addresses",
        "type": "RECORD",
        "mode": "REPEATED",
        "fields": [
            {
                "name": "status",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "address",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "city",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "state",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "zip",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "numberOfYears",
                "type": "STRING",
                "mode": "NULLABLE"
            }
        ]
    }
]

For information on specifying a nested and repeated schema, see Specifying Nested and Repeated Fields.

Appending to or overwriting a table with JSON data

You can load additional data into a table either from source files or by appending query results. If the schema of the data does not match the schema of the destination table or partition, you can update the schema when you append to it or overwrite it.

If you update the schema when appending data, BigQuery allows you to:

  • Add new fields
  • Relax REQUIRED fields to NULLABLE

If you are overwriting a table, the schema is always overwritten. Schema updates are not restricted when you overwrite a table.

In the BigQuery web UI, you use the Write preference option to specify what action to take when you load data from a source file or from a query result. The CLI and API include the following options:

Web UI option CLI flag BigQuery API property Description
Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table --noreplace or --replace=false; if --[no]replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

To load JSON data from Google Cloud Storage and to append to or overwrite a BigQuery table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the UI, but wildcards are supported.
    • For File format, select JSON (Newline Delimited).
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're appending or overwriting.
    • Verify that Table type is set to Native table.
  5. In the Schema section, enter the schema definition. To update the schema, you can add new fields or change (relax) fields from REQUIRED to NULLABLE.

    • For JSON files, you can check the Auto-detect option to enable schema auto-detection.

      auto detect link

    • You can also enter schema information manually by:

      • Clicking Edit as text and entering the table schema as a JSON array:

        Add schema as JSON array

      • Using Add Field to manually input the schema:

        Add schema using add fields

  6. In the Options section, for Write preference, choose Write if empty, Append to table, or Overwrite table.

    Add schema using add fields

  7. Click Create Table.

Command-line

Enter the bq load command with the --replace flag to overwrite the table. Use the --noreplace flag to append data to the table. If no flag is specified, the default is to append data.

When appending or overwriting a table, you can use the --schema_update_option flag to update the schema of the destination table with the schema of the new data. The following options can be used with the --schema_update_option flag:

  • ALLOW_FIELD_ADDITION: Adds new fields to the schema; new fields cannot be REQUIRED
  • ALLOW_FIELD_RELAXATION: Relaxes required fields to nullable; repeat this option to specify a list of values

bq load --[no]replace [DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA]

where:

  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
  • [SCHEMA] is a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

In addition, you can add flags for JSON options that allow you to control how BigQuery parses your data.

Examples:

  • The following command loads data from gs://mybucket/mydata.json and overwrites a table named mytable in mydataset. The schema is defined using schema auto-detection:

    bq load --autodetect --replace --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable gs://mybucket/mydata.json
    
  • The following command loads data from gs://mybucket/mydata.json and appends data to a table named mytable in mydataset. The schema is defined using a JSON schema file — myschema.json:

    bq load --autodetect --noreplace --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable gs://mybucket/mydata.json ./myschema.json
    
  • The following command loads data from gs://mybucket/mydata.json and appends data to a table named mytable in mydataset. A local JSON schema file named myschema.json is used. The schema definition contains new fields not present in the destination table:

    bq load --noreplace --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable gs://mybucket/mydata.json ./myschema.json
    
  • The following command loads data from gs://mybucket/mydata.csv and appends data to a table named mytable in mydataset. A local JSON schema file named myschema.json is used. The schema definition changes (relaxes) two REQUIRED fields to NULLABLE:

    bq load --noreplace --schema_update_option=ALLOW_FIELD_RELAXATION --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable gs://mybucket/mydata.json ./myschema.json
    

API

Set the following properties to load JSON data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported when loading CSV data from Google Cloud Storage.

  3. Specify the data format by setting the configuration.load.sourceFormat property to NEWLINE_DELIMITED_JSON.

  4. Specify the write preference by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE, WRITE_APPEND, or WRITE_EMPTY.

  5. To update the schema in your load job, set the configuration.load.schemaUpdateOptions property to ALLOW_FIELD_ADDITION or ALLOW_FIELD_RELAXATION.

JSON options

To change how BigQuery parses JSON data, specify additional options in the UI, CLI, or API.

JSON option Web UI option CLI flag BigQuery API property Description
Number of bad records allowed Number of errors allowed --max_bad_records maxBadRecords (Optional) The maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an invalid error is returned in the job result. The default value is 0, which requires that all records are valid.
Unknown values Ignore unknown values --ignore_unknown_values ignoreUnknownValues (Optional) Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value: CSV: Trailing columns JSON: Named values that don't match any column names

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...