Caching queries and rebuilding PDTs with datagroups

Stay organized with collections Save and categorize content based on your preferences.

Looker reduces the load on your database and improves performance by using cached results of prior queries when available and permitted by your caching policy. In addition, you can create complex queries as persistent derived tables (PDTs), which store their results to simplify later queries.

Datagroups are very handy for coordinating your database's ETL (extract, transform, and load) schedule with Looker's caching policy and PDT rebuilding schedule. You can use a datagroup to specify the rebuilding trigger for PDTs based on when new data is added to your database. In the same datagroup, you can specify a maximum cache age for Explore queries to act as a fail-safe in the event that the PDT rebuild is not triggered for some reason. That way, the failure mode for a datagroup will be to query the database rather than serve stale data from the Looker cache.

Alternatively, you can use a datagroup to decouple the PDT rebuilding trigger from your maximum cache age. This can be useful if you have an Explore based on data that updates very frequently and that Explore is joined to a PDT that is rebuilt less frequently. In this case, you would want your query cache to reset more frequently than your PDT is rebuilt.

For details on query caching, see the How Looker uses cached queries section on this documentation page.

Defining a datagroup

A datagroup is defined with the datagroup parameter, either in a model file or in its own LookML file. You can define multiple datagroups if you want different caching and PDT rebuild policies for different Explores and/or PDTs in your project.

The datagroup parameter can have the following subparameters:

  • label — Specifies an optional label for the datagroup.
  • description — Specifies an optional description for the datagroup that can be used to explain the datagroup's purpose and mechanism.
  • max_cache_age — Specifies a string that defines a time period. When the age of a query's cache exceeds the time period, Looker invalidates the cache. The next time the query is issued, Looker sends the query to the database for fresh results.
  • interval_trigger — Specifies a time schedule for triggering the datagroup, such as "24 hours".

See the datagroup documentation page for more information on these parameters.

At a minimum, a datagroup must have at least the max_cache_age parameter, the sql_trigger parameter, or the interval_trigger parameter.

A datagroup cannot have both sql_trigger and interval_trigger parameters. If you define a datagroup with both parameters, the datagroup will use the interval_trigger value and ignore the sql_trigger value, since the sql_trigger parameter requires Looker to use database resources when querying the database.

Here is an example of a datagroup that has a sql_trigger set up to rebuild the PDT every day. In addition, the max_cache_age is set to clear the query cache every two hours, in case any Explores join PDTs to other data that refreshes more frequently than once a day.

datagroup: customers_datagroup {
  sql_trigger: SELECT DATE(NOW());;
  max_cache_age: "2 hours"
}

Once you define the datagroup, you can assign it to Explores and PDTs:

Using a datagroup to specify a rebuild trigger for PDTs

To define a PDT rebuilding trigger using datagroups, create a datagroup parameter with either the sql_trigger or the interval_trigger subparameter. Then assign the datagroup to individual PDTs using the datagroup_trigger subparameter in the PDT's derived_table definition. If you use datagroup_trigger for your PDT, you don't need to specify any other persistence strategy for the derived table. If you specify multiple persistence strategies for a PDT, you will get a warning in the Looker IDE, and only the datagroup_trigger will be used.

Here is an example of a PDT definition that uses the customers_datagroup datagroup. This definition also adds several indexes, on both customer_id and first_order_date. For more information about defining PDTs, see the Derived tables in Looker documentation page.

view: customer_order_facts {
  derived_table: {
    sql: ... ;;
    datagroup_trigger: customers_datagroup
    indexes: ["customer_id", "first_order_date"]
  }
}

If you have cascading PDTs, PDTs that are dependent on other PDTs, be careful not to specify incompatible datagroup caching policies.

For connections that have user attributes to specify the connection parameters, you must create a separate connection using the PDT override fields if you want to do either of the following:
Use PDTs in your model
Use a datagroup to define a PDT rebuild trigger
Without the PDT overrides, you can still use a datagroup with max_cache_age to define the caching policy for Explores.

See the Derived tables in Looker documentation page for more information how datagroups work with PDTs.

Using a datagroup to specify query cache reset for Explores

When a datagroup is triggered, the Looker regenerator will rebuild the PDTs that use that datagroup as a persistence strategy. Once the datagroup's PDTs are rebuilt, Looker will clear the cache for Explores that use the datagroup's rebuilt PDTs. You can add the max_cache_age parameter to your datagroup definition if you want to customize a query cache reset schedule for the datagroup. The max_cache_age parameter lets you clear the query cache on a specified schedule, in addition to the automatic query cache reset that Looker performs when the datagroup's PDTs are rebuilt.

To define a query caching policy with datagroups, create a datagroup parameter with the max_cache_age subparameter.

To specify a datagroup to use for query cache resets on Explores, use the persist_with parameter:

The following examples shows a datagroup named orders_datagroup that is defined in a model file. The datagroup has a sql_trigger parameter, which specifies that the query select max(id) from my_tablename will be used to detect when an ETL has happened. Even if that ETL doesn't happen for a while, the datagroup's max_cache_age specifies that the cached data will be used only for a maximum of 24 hours.

The model's persist_with parameter points to the orders_datagroup caching policy, which means this will be the default caching policy for all Explores in the model. However, we don't want to use the model's default caching policy for the customer_facts and customer_background Explores, so we can add the persist_with parameter to specify a different caching policy for these two Explores. The orders and orders_facts Explores don't have a persist_with parameter, so they will use the model's default caching policy: orders_datagroup.

datagroup: orders_datagroup {
  sql_trigger: SELECT max(id) FROM my_tablename ;;
  max_cache_age: "24 hours"
}

datagroup: customers_datagroup {
  sql_trigger: SELECT max(id) FROM my_other_tablename ;;
}

persist_with: orders_datagroup

explore: orders { ... }

explore: order_facts { ... }

explore: customer_facts {
  persist_with: customers_datagroup
  ...
}

explore: customer_background {
  persist_with: customers_datagroup
  ...
}

If both persist_with and persist_for are specified, then you will receive a validation warning and the persist_with will be used.

Using a datagroup to trigger scheduled deliveries

Datagroups can also be used to trigger a delivery of a dashboard, a legacy dashboard, or a Look. With this option, Looker will send your data when the datagroup completes, so that the scheduled content is up to date.

Using the Admin panel for datagroups

If you have the Looker admin role, you can use the Admin panel's Datagroups page to view the existing datagroups. You can see the connection, model, and current status of each datagroup and — if specified in the LookML — a label and description for each datagroup. You can also reset the cache for a datagroup, trigger the datagroup, or navigate to the datagroup's LookML.

How Looker uses cached queries

For queries, the caching mechanism in Looker works as follows:

  1. Once a user runs a specific query, the result of that query is cached. Cache results are stored in an encrypted file on your Looker instance.
  2. When a new query is written, the cache is checked to see if the exact query was previously run. All fields, filters, and parameters must be the same, including settings such as row limits. If the query is not found, then Looker runs the query against the database to get fresh database results (and those results are then cached).

    Context comments do not affect caching. Looker adds a unique comment to the beginning of each SQL query. However, as long as the SQL query itself is the same as a previous query (not including the context comments), Looker will use cached results.

  3. If the query is found in the cache, then Looker checks the caching policy defined in the model to see if the cache is still valid. By default, Looker invalidates cached results after an hour. You can use a persist_for parameter (at the model level or the Explore level) or the more powerful datagroup parameter to specify the caching policy that determines the circumstances under which the cached results become invalid and should be ignored. An admin can also invalidate the cached results for a datagroup.

    • If the cache is still valid, then those results are used.
    • If the cache is no longer valid, then Looker runs the query against the database to get fresh query results. (Those new results are also cached.)

Seeing whether a query was returned from cache

In an Explore window, you can determine whether a query has been returned from the cache by looking in the upper right corner after running a query.

When a query is returned from the cache, "from cache" is shown. Otherwise, the amount of time to return the query is shown.

Forcing new results to be generated from the database

In an Explore window, you can force new results to be retrieved from the database. Select the Clear Cache & Refresh option from the gear menu, which you'll find in the upper right of the screen after running a query (including merged results queries):

A persistent derived table normally is regenerated based on the PDT's persistence strategy. You can force the derived table to rebuild early if your admin has given you the develop permission, and you are viewing an Explore that includes fields from the PDT. Select the Rebuild Derived Tables & Run option from the gear drop-down menu, which you'll find in the upper right of the screen after running a query:

See the Derived tables in Looker documentation page for further details about the Rebuild Derived Tables & Run option.

How long is data stored in the cache?

To specify the amount of time before the cached results become invalid, use the persist_for parameter (for a model or for an Explore) or the max_cache_age parameter (for a datagroup). Here are the different behaviors along the timeline, depending on whether the persist_for or max_cache_age time has expired:

  • Before the persist_for or max_cache_age time expires: If the query is rerun, Looker pulls data from the cache.
  • When the persist_for or max_cache_age time expires: Looker deletes data from the cache, unless you have enabled the Instant Dashboards Looker Labs feature.
  • After the persist_for or max_cache_age time expires: If the query is rerun, Looker pulls the data from the database directly and resets the persist_for or max_cache_age timer.

One key point here is that the data is deleted from the cache when the persist_for or max_cache_age time expires, as long as the Instant Dashboards Looker Labs feature is disabled. (The Instant Dashboards feature requires the cache in order to immediately load cached results into a dashboard.) If you enable Instant Dashboards, data stays in the cache for 30 days, or until the cache storage limits are reached. If the cache reaches the storage limit, data is ejected based on a Least Recently Used (LRU) algorithm, with no guarantee that data with expired persist_for or max_cache_age timers will be deleted all at once.

Minimizing the time your data spends in the cache

Looker requires the disk cache for internal processes, so data will always be written to the cache, even if you set the persist_for and max_cache_age parameters to 0. Once written to the cache, the data will be flagged for deletion but may live up to 10 minutes on disk.

However, all customer data that appears in the disk cache is Advanced Encryption Standard (AES) encrypted, and you can minimize the amount of time that data is stored in the cache by making these changes:

  • Disable the Instant Dashboards Looker Labs feature, which requires Looker to store data in cache.
  • For any persist_for parameter (for a model or an Explore) or max_cache_age parameter (for a datagroup), set the value to 0. For Looker instances with Instant Dashboards off, Looker deletes the cache when the persist_for time expires, or when the data reaches the max_cache_age specified in its datagroup. (This is not necessary for the persist_for parameter of persistent derived tables, because persistent derived tables are written to the database itself, and not to the cache.)
  • Set the suggest_persist_for parameter to a small amount of time. The suggest_persist_for value specifies how long Looker should keep filter suggestions in the cache. The filter suggestions are based on a query of the values for the field being filtered. These query results are kept in the cache so Looker can quickly provide suggestions as the user types in the filter text field. The default is to cache the filter suggestions for 6 hours. To minimize the amount of time your data is in the cache, set the suggest_persist_for value to something lower, such as 5 minutes.