Specifying Nested and Repeated Columns

BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, denormalize your data and take advantage of nested and repeated columns. Nested and repeated columns can maintain relationships without the performance impact of preserving a relational (normalized) schema.

You can specify nested or nested and repeated data in the UI or a JSON schema file. To specify nested or nested and repeated columns, you use the RECORD (STRUCT) data type.

Specifying nested and repeated columns

BigQuery supports loading nested and repeated data from source formats that support object-based schemas, such as JSON, Avro, and Cloud Datastore backup files.

For example, a relational database used to track library books would likely keep all author information in a separate table. A key such as author_id would be used to link the book to the authors.

In BigQuery, you can preserve the relationship between book and author without creating a separate author table. Instead, you create an author column, and you nest fields within it such as the author's first name, last name, date of birth, and so on. If a book has multiple authors, you can make the nested author column repeated.

To create a column with nested and repeated data, set the data type of the column to RECORD. A RECORD is stored as a STRUCT and can be accessed as a STRUCT in standard SQL. A STRUCT is a container of ordered fields each with a type (required) and a name (optional). To make the column nested, you add child fields to the RECORD (parent). To make the column repeated, change the mode to REPEATED.

Example

The following example shows sample nested and repeated data. This table contains information about people. It consists of the following fields:

  • id
  • first_name
  • last_name
  • dob (date of birth)
  • addresses (a nested and repeated field)
    • addresses.status (current or previous)
    • addresses.address
    • addresses.city
    • addresses.state
    • addresses.zip
    • addresses.numberOfYears (years at the address)

The JSON data file would look like the following. Notice that the addresses column contains an array of values (indicated by [ ]). The multiple addresses in the array are the repeated data. The multiple fields within each address are the nested data.

{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
{"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}

The schema for this table would look like the following:

[
    {
        "name": "id",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "first_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "last_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "dob",
        "type": "DATE",
        "mode": "NULLABLE"
    },
    {
        "name": "addresses",
        "type": "RECORD",
        "mode": "REPEATED",
        "fields": [
            {
                "name": "status",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "address",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "city",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "state",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "zip",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "numberOfYears",
                "type": "STRING",
                "mode": "NULLABLE"
            }
        ]
    }
]

Specifying nested and repeated fields

Web UI

To specify the nested and repeated addresses column in the BigQuery web UI:

  1. Go to the BigQuery web UI.

    Go to the BigQuery web UI

  2. Click the down arrow icon down arrow icon next to your dataset name in the navigation and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create table page:

    • For Source Data, click Create from source.
    • For Destination Table, choose your dataset and enter the table name in the Destination table name field.
    • For Schema:

      • In the Name field, type addresses.
      • For Type, select RECORD.
      • For Mode, choose REPEATED.
      • To the right of RECORD, click the plus icon plus icon to add a nested field.
      • In the Name field, type status. Notice the field is pre-populated with addresses. indicating it's a nested field.
      • For Type, choose STRING.
      • For Mode, leave the value set to NULLABLE.
      • Repeat these steps to add address (NULLABLE STRING), city (NULLABLE STRING), state (NULLABLE STRING), zip (NULLABLE STRING), and numberOfYears (NULLABLE STRING).

        Nested field in the UI

    • Alternatively, click Edit as Text and specify the schema as a JSON array.

When you examine the schema in the BigQuery web UI, the addresses field looks like the following:

Nested field on the Schema tab

Command-line

To specify the nested and repeated addresses column in a JSON schema file, enter the following using a text editor:

[
    {
        "name": "id",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "first_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "last_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "dob",
        "type": "DATE",
        "mode": "NULLABLE"
    },
    {
        "name": "addresses",
        "type": "RECORD",
        "mode": "REPEATED",
        "fields": [
            {
                "name": "status",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "address",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "city",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "state",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "zip",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "numberOfYears",
                "type": "STRING",
                "mode": "NULLABLE"
            }
        ]
    }
]

After creating your JSON schema file, you can supply it via the command line.

Python

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_ref = client.dataset('my_dataset')

schema = [
    bigquery.SchemaField('id', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('first_name', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('last_name', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('dob', 'DATE', mode='NULLABLE'),
    bigquery.SchemaField('addresses', 'RECORD', mode='REPEATED', fields=[
        bigquery.SchemaField('status', 'STRING', mode='NULLABLE'),
        bigquery.SchemaField('address', 'STRING', mode='NULLABLE'),
        bigquery.SchemaField('city', 'STRING', mode='NULLABLE'),
        bigquery.SchemaField('state', 'STRING', mode='NULLABLE'),
        bigquery.SchemaField('zip', 'STRING', mode='NULLABLE'),
        bigquery.SchemaField('numberOfYears', 'STRING', mode='NULLABLE'),
    ]),
]
table_ref = dataset_ref.table('my_table')
table = bigquery.Table(table_ref, schema=schema)
table = client.create_table(table)  # API request

print('Created table {}'.format(table.full_table_id))
Cette page vous a-t-elle été utile ? Évaluez-la :

Envoyer des commentaires concernant…