Nested data in BigQuery (repeated records)

BigQuery supports nested records in tables. Nested records can be a single record or contain repeated values. This page provides an overview of working with BigQuery nested data in Looker.

The advantages of nested records

There are a few advantages to using nested records when you're scanning a distributed dataset:

  • Nested records do not require joins. This means that computations can be faster and scan much less data than if you had to rejoin the extra data each time you query it.
  • Nested structures are essentially pre-joined tables. There is no added expense for the query if you do not reference the nested column, because BigQuery data is stored in columns. If you do reference the nested column, the logic is identical to a colocated join.
  • Nested structures avoid repeating data that would have to be repeated in a wide denormalized table. In other words, for a person who has lived in five cities, a wide denormalized table would contain all their information in five rows (one for each of the cities they have lived in). In a nested structure, the repeated information only takes one row since the array of five cities can be contained in a single row and unnested when needed.

Working with nested records in LookML

The following BigQuery table, persons_living, displays a typical schema that stores example user data, including fullName, age, phoneNumber, and citiesLived along with the datatype and mode of each column. The schema shows that the values in the citiesLived column are repeated, indicating that some users may have lived in multiple cities:

The following example is the LookML for the Explores and views you can create from the previous schema shown. There are three views: persons, persons_cities_lived, and persons_phone_number. The Explore appears identical to an Explore that is written with non-nested tables.

Note: While all the components (views and Explore) are written in one code block in the following example, it is best practice to place views in individual view files and to place Explores and connection: specification in the model file.

-- model file

connection: "bigquery_publicdata_standard_sql"

explore: persons {

  # Repeated nested object
  join: persons_cities_lived {
    view_label: "Persons: Cities Lived:"
    sql: LEFT JOIN UNNEST(persons.citiesLived) as persons_cities_lived ;;
    relationship: one_to_many
  }

  # Non repeated nested object
  join: persons_phone_number {
    view_label: "Persons: Phone:"
    sql: LEFT JOIN UNNEST([${persons.phoneNumber}]) as persons_phone_number ;;
    relationship: one_to_one
  }

}

-- view files

view: persons {

 sql_table_name: bigquery-samples.nested.persons_living ;;

  dimension: id {
    primary_key: yes
    sql: ${TABLE}.fullName ;;
  }

  dimension: fullName {label: "Full Name"}

  dimension: kind {}

  dimension: age {type:number}

  dimension: citiesLived {hidden:yes}

  dimension: phoneNumber {hidden:yes}

  measure: average_age {
    type: average
    sql: ${age} ;;
    drill_fields: [fullName,age]
  }

  measure: count {
    type: count
    drill_fields: [fullName, cities_lived.place_count, age]
  }
}

view: persons_phone_number {

  dimension: areaCode {label: "Area Code"}

  dimension: number {}
}

view: persons_cities_lived {

  dimension: id {
    primary_key: yes
    sql: CONCAT(CAST(${persons.fullName} AS STRING),'|', CAST(${place} AS STRING)) ;;
  }

  dimension: place {}

  dimension: numberOfYears {
    label: "Number Of Years"
    type: number
  }

  measure: place_count {
    type: count
    drill_fields: [place, persons.count]
  }

  measure: total_years {
    type: sum
    sql: ${numberOfYears} ;;
    drill_fields: [persons.fullName, persons.age, place, numberOfYears]
  }

}

Each component for working with nested data in LookML is discussed in greater detail in the following sections:

Views

Each nested record is written as a view. For example, the phoneNumber view simply declares the dimensions that appear in the record:

view: persons_phone_number {

  dimension: areaCode {label: "Area Code"}

  dimension: number {}

}

The persons_cities_lived view is more complex. As shown in the LookML example, you define the dimensions that appear in the record (numberOfYears and place), but you can also define some measures. The measures and drill_fields are defined as usual, as if this data were in its own table. The only real difference is that you declare id as a primary_key so that aggregates are properly calculated.

view: persons_cities_lived {

  dimension: id {
    primary_key: yes
    sql: CONCAT(CAST(${persons.fullName} AS STRING),'|', CAST(${place} AS STRING)) ;;
  }

  dimension: place {}

  dimension: numberOfYears {
    label: "Number Of Years"
    type: number
  }

  measure: place_count {
    type: count
    drill_fields: [place, persons.count]
  }

  measure: total_years {
    type: sum
    sql: ${numberOfYears} ;;
    drill_fields: [persons.fullName, persons.age, place, numberOfYears]
  }

}

Record declarations

In the view that contains the subrecords (in this case persons), you need to declare the records. These will be used when you create the joins. You can hide these LookML fields with the hidden parameter because you won't need them when exploring the data.

view: persons {

  ...
  dimension: citiesLived {
    hidden:yes
    }

  dimension: phoneNumber {
    hidden:yes
    }
  ...

}

Joins

Nested records in BigQuery are arrays of STRUCT elements. Instead of joining with a sql_on parameter, the join relationship is built into the table. In this case, you can use the sql: join parameter so that you can use the UNNEST operator. Other than that difference, unnesting an array of STRUCT elements is exactly like joining a table.

In the case of non-repeated records, you can simply use STRUCT; you can turn that into an array of STRUCT elements by placing it in square brackets. While this may appear strange, there seems be be no performance penalty — and this keeps things clean and simple.

explore: persons {

  # Repeated nested object
  join: persons_cities_lived {
    view_label: "Persons: Cities Lived:"
    sql: LEFT JOIN UNNEST(persons.citiesLived) as persons_cities_lived ;;
    relationship: one_to_many
  }

  # Non repeated nested object
  join: persons_phone_number {
    view_label: "Persons: Phone:"
    sql: LEFT JOIN UNNEST([${persons.phoneNumber}]) as persons_phone_number ;;
    relationship: one_to_one
  }

}

Joins for arrays without unique keys for each row

While it is best to have identifiable natural keys in the data, or surrogate keys created in the ETL process, this is not always possible. For example, you could encounter a situation where some arrays don't have a relative unique key for the row. This is where WITH OFFSET can come in handy in join syntax.

For example, a column representing a person might load multiple times if the person has lived in multiple cities — Chicago, Denver, San Francisco, etc. It can be difficult to create a primary key on the unnested row if a date or other identifiable natural key is not provided to distinguish the person's tenure in each city. This is where WITH OFFSET can provide a relative row number (0,1,2,3) for each unnested row. This approach guarantees a unique key on the unnested row:

explore: persons {

  # Repeated nested Object
  join: persons_cities_lived {
    view_label: "Persons: Cities Lived:"
    sql: LEFT JOIN UNNEST(persons.citiesLived) as persons_cities_lived WITH OFFSET as person_cities_lived_offset;;
    relationship: one_to_many
  }

}

view: persons_cities_lived {

  dimension: id {
    primary_key: yes
    sql: CONCAT(CAST(${persons.fullName} AS STRING),'|', CAST(${offset} AS STRING)) ;;
  }

  dimension: offset {
    type: number
    sql: person_cities_lived_offset;;
  }

}

Simple repeated values

Nested data in BigQuery can also be simple values, such as integers or strings. To unnest arrays of simple repeated values, you can use a similar approach as shown previously, using the UNNEST operator in a join.

The following example unnests a given array of integers, `unresolved_skus`:

explore: impressions {
  join: impressions_unresolved_sku {
    sql: LEFT JOIN UNNEST(unresolved_skus) AS impressions_unresolved_sku ;;
    relationship: one_to_many
  }

}

view: impressions_unresolved_sku {

  dimension: sku {
    type: string
    sql: ${TABLE} ;;
  }

}

The sql parameter for the array of integers, unresolved_skus, is represented as ${TABLE}. This directly references the table of values itself, which is then unnested in the explore.