Query Building Guide

Introduction

In this section we will describe how to properly configure your anomaly detection queries.

It can be helpful to also read the concepts page, which defines some of the terms we use throughout this page.

At a high level, the Timeseries Insights API anomaly detection queries verify if there are any slices in your dataset that are showing unexpected values at one specific point in time, which is called the detection time and is set by the top-level parameter QueryDataSetRequest.detectionTime.

Internally, there are three conceptual steps that take place to resolve an anomaly detection query:

  • The first step in evaluating the query is to split the dataset into slices, which will be analyzed and marked as anomalies individually based on their value at the detection time. This step is controlled by the QueryDataSetRequest.slicingParams.

  • The second step is to compute, for each slice, a time series that is the result of grouping by time and aggregating the events in that slice. The time series contains multiple data points up to the detection time and its purpose is to predict the expected value at the detection time. This step is controlled by the QueryDataSetRequest.timeseriesParams.

  • The third and final step is to evaluate each slice by forecasting the value at detection time with the time series we built on the previous step and, based on the desired sensitivity, classify any slices that are outside the expected bounds as an anomaly. This process is configured by the QueryDataSetRequest.forecastParams.

Prerequisites

Follow the setup instructions in our Quickstart guide to ensure you can run all commands in this guide.

We will show how varying each parameter impacts the returned results by querying the public demo dataset that we already preloaded with data from the GDELT Project.

Base query

Save the following query as query.json in your working directory:

{
  "detectionTime": "2019-04-15T00:00:00Z",
  "slicingParams": {
    "dimensionNames": ["EntityLOCATION"]
  },
  "timeseriesParams": {
    "forecastHistory": "2592000s",
    "granularity": "86400s"
  }
}

Issue the query with the following command:

gcurl -X POST -d @query.json https://timeseriesinsights.googleapis.com/v1/projects/timeseries-insights-api-demo/datasets/webnlp-201901-202104:query

The previous query represents the simplest anomaly detection request you can send, in which only the required parameters are populated.

We set the detectionTime to "2019-04-15T00:00:00Z", indicating which moment in time we want to analyze and raise an anomaly if the expected value is different than the actual value.

The value at the detection time is given by accumulating and aggregating the events that occurred within the [detectionTime, detectionTime + granularity] time interval. In the previous query we specify this granularity by setting timeseriesParams.granularity to "86400s".

Besides the time accumulation window given by the granularity, we also have to consider how events are grouped into slices. This is specified via the slicingParams.dimensionNames parameter, which in our example is set to one dimension: "EntityLOCATION". As a result, we will have to analyze as many slices as there are unique values for the "EntityLOCATION" dimension in the events in our dataset.

Last, we must specify how many data points will be included in the time series used to forecast the value at detection time. Including more points can capture more seasonal and trend patterns. To do this, set the timeseriesParams.forecastHistory field to the duration we want to capture, in this case, "2592000s" (30 days).

Slicing specification

How the events are grouped into slices is controlled at query time by the SlicingParams.dimensionNames parameter, which in our previous example was set to ["EntityLOCATION"].

We can specify additional dimensions by doing the following:

{
  "detectionTime": "2019-04-15T00:00:00Z",
  "slicingParams": {
    "dimensionNames": ["EntityLOCATION", "EntityORGANIZATION"]
  },
  "timeseriesParams": {
    "forecastHistory": "2592000s",
    "granularity": "86400s"
  }
}

Rerun the query with the added dimension and notice the change in the ForecastSlice.dimensions field.

If interested in analyzing only a subset of the dataset, you can specify a value filter for some dimensions in your dataset by setting the SlicingParams.pinnedDimensions. For example, we can filter in our analysis only events (and thus slices) that have the value "AP" for the dimension "EntityORGANIZATION":

{
  "detectionTime": "2019-04-15T00:00:00Z",
  "slicingParams": {
    "dimensionNames": ["EntityLOCATION"],
    "pinnedDimensions": [
      {
        "name": "EntityORGANIZATION",
        "stringVal": "AP"
      }
    ]
  },
  "timeseriesParams": {
    "forecastHistory": "2592000s",
    "granularity": "86400s"
  }
}

Time series configuration

This section describes how you can control the process of aggregating events into time series for each slice.

Granularity

The granularity of the time series represents the fixed distance between consecutive time series points. Consider two things in choosing the granularity:

  • Lowering the granularity allows the forecasting algorithms to capture more seasonal patterns (if present in the time series). For example, the current granularity we have set in query.json is daily ("granularity": "86400s"), which does not allow us to capture hourly seasonality patterns. If interested in hourly seasonality, we should lower the granularity to at most 3600s.

  • The granularity implicitly represents the width of the detection aggregation window. Running with a finer granularity will cause us to evaluate a shorter time interval (and fewer events that fall in that interval).

As mentioned before, the granularity can be set through the TimeseriesParams.granularity parameter.

Let's lower it to 3600s in query.json:

{
  "returnTimeseries": true,
  "detectionTime": "2019-04-15T00:00:00Z",
  "slicingParams": {
    "dimensionNames": ["EntityLOCATION"]
  },
  "timeseriesParams": {
    "forecastHistory": "2592000s",
    "granularity": "3600s"
  },
  "forecastParams": {
    "sensitivity": 0.05,
    "noiseThreshold": 20
  }
}

Rerun the query and observe that the returned time series contains points separated by exactly 1 hour, which is the granularity we set. Experiment with other granularities and observe the effect.

Sample partial output:

...
{
  "time": "2019-04-13T01:00:00Z",
  "value": 15
},
{
  "time": "2019-04-13T02:00:00Z",
  "value": 23
},
{
  "time": "2019-04-13T03:00:00Z",
  "value": 22
},
...

Metric

The value of each point in the time series, including the value at the detection time, is calculated by aggregating a numerical dimension that must be present in the events accumulated for that point. The name of that dimension is called metric and is configured by the TimeseriesParams.metric field.

If no metric is specified, then the value of each time series point is set as the number of events accumulated at that point.

Forecast history

How much we go back in time and how many points we include in our time series is given by the TimeseriesParams.forecastHistory parameter.

In our initial example:

  • We have set forecastHistory: "2592000s", so that tells us that we should fetch events between March 16, 2019 and April 15, 2019 and form a time series between these two days.
  • Each data point in the time series will cover a period of time equal to the granularity (so 86400s - 1 day in our example).
  • The time series will have 30 points, each point holding the aggregated value for the events in that calendar day. Assuming no explicit numerical dimension was specified as the metric, the first time series point will have as value the total number of events on March 16, 2019, the second point from March 17, 2019, and so on.

You can see the historical values for time series returned in the ForecastResult.history field (it will only be returned if QueryDataSetRequest.returnTimeseries is set to true in the request).

NOTE: We are modeling a time series in the Timeseries proto as a collection of points (defined as TimeseriesPoint), sorted by time and having as value (TimeseriesPoint.value) the aggregated value for the slice during the [TimeseriesPoint.time, TimeseriesPoint.time + granularity] time interval.

By increasing the TimeseriesParams.forecastHistory and including a longer history, you can capture certain seasonality patterns and, potentially, increase the accuracy of the forecast as it will contain more data. For example, if we have monthly seasonality patterns, having the forecastHistory set to 30 days won't allow us to capture those patterns, so, in this scenario, we should increase the forecastHistory so we have multiple monthly patterns in our analyzed time series (for example, for monthly patterns to be detected we recommend the forecastHistory to be at least 100 days).

From our initial example, try reducing the granularity to 3600s (1 hour) and increasing ForecastParams.forecastHistory to 8640000s (100 days). This will increase the number of points in the history time series to 2400, making it more granular and covering a longer period of time.

Minimum density

If interested in ignoring the prediction for sparse time series that do not contain enough data points, then you can set the TimeseriesParams.minDensity parameter to a value between 0 and 100, which represents the percentage of points that the time series must contain from its start to its end time for it to be accepted for analysis.

Forecasting configuration

To compute the expected value during the tested time interval, we employ forecasting algorithms, which, based on the historical values for the slice, predict what the value should be during the tested interval.

Sensitivity tuning

Based on the forecasted bounds and the actual value at the detection time for a slice, we may classify it or not as an anomaly based on the sensitivity parameters. These parameters are:

  • ForecastParams.sensitivity specifies how sensitive the anomaly detection process should be. The lower the value, the less sensitive it is and the fewer anomalies returned. The sensitivity must be in the (0.0, 1.0] interval.
  • ForecastParams.noiseThreshold gives a minimum threshold for the variation between the forecasted and actual values at the detection time for its associated slice to be marked as an anomaly. Set this higher if dealing with noisy low-volume slices.

As a baseline, try setting the sensitivity parameters to the following:

{
  "detectionTime": "2019-04-15T00:00:00Z",
  "slicingParams": {
    "dimensionNames": ["EntityLOCATION"]
  },
  "timeseriesParams": {
    "forecastHistory": "2592000s",
    "granularity": "86400s"
  },
  "forecastParams": {
    "sensitivity": 1.0,
    "noiseThreshold": 0.0
  }
}

This is the most sensitive query that we can run, and it will mark as anomalies all slices that have the detectionPointActual outside the [detectionPointForecastLowerBound, detectionPointForecastUpperBound] bounds. While this might seem like what we want (and could be useful in some applications), in practice we will be interested in ignoring most of these anomalies, because the false positive rate will be high.

Running this baseline query will result in ~2500 anomalies:

$ gcurl -X POST -d @query.json https://timeseriesinsights.googleapis.com/v1/projects/timeseries-insights-api-demo/datasets/webnlp-201901-202104:query \
    | grep "detectionPointActual" | wc -l

2520

If we decrease the sensitivity to 0.1, the number of anomalies is reduced to ~1800, which got rid of slices with smaller variances, but still, we got a high number of slices classified as anomalies.

Assuming we would only be interested in the biggest spikes in the news during that day, we could also increase noiseThreshold, which will filter out low volume slices. Let's set it to 250.0, which will give us our final query.json:

{
  "detectionTime": "2019-04-15T00:00:00Z",
  "slicingParams": {
    "dimensionNames": ["EntityLOCATION"]
  },
  "timeseriesParams": {
    "forecastHistory": "2592000s",
    "granularity": "86400s"
  },
  "forecastParams": {
    "sensitivity": 0.1,
    "noiseThreshold": 250.0
  }
}

Running the previous query will yield only 3 anomalies, all associated with the Notre Dame fire, which was the most mentioned event in the news on April 15th 2019:

$ gcurl -X POST -d @query.json https://timeseriesinsights.googleapis.com/v1/projects/timeseries-insights-api-demo/datasets/webnlp-201901-202104:query \
    | grep stringVal

            "stringVal": "Ile de la Cite"
            "stringVal": "Notre Dame"
            "stringVal": "Seine"

Horizon

The time horizon, ForecastParams.horizonTime, specifies how much into the future, starting from the detection time, we should predict the values based on the historical time series.

If QueryDataSetRequest.returnTimeseries is set to true, then the forecasted time series will be returned in ForecastResult.forecast for each slice and will contain the predicted values between the detectionTime + granularity and detectionTime + granularity + horizonTime.

Query performance and resource usage

Besides reducing noise and filtering out non-interesting slices, reducing the anomaly detection sensitivity can also significantly reduce query latency and resource usage.

Increasing the noiseThreshold generally results in the most noticeable reduction in anomalous slices, it's also the main sensitivity parameter to increase to improve the query performance.

You can play with different sensitivity parameters to see how they impact the performance of the query. For example:

$ cat query.json
{
  "detectionTime": "2019-04-15T00:00:00Z",
  "slicingParams": {
    "dimensionNames": ["EntityLOCATION"]
  },
  "timeseriesParams": {
    "forecastHistory": "2592000s",
    "granularity": "86400s"
  },
  "forecastParams": {
    "sensitivity": 1.0,
    "noiseThreshold": 0.0
  }
}

$ time gcurl -X POST -d @query.json https://timeseriesinsights.googleapis.com/v1/projects/timeseries-insights-api-demo/datasets/webnlp-201901-202104:query \
    | grep stringVal | wc -l

2580

real 0m26.765s
user 0m0.618s
sys  0m0.132s

We can see that the most sensitive parameters lead to our query taking ~26.7s. Reducing the sensitivity provides a speed boost up to ~3.4s:

$ cat query.json
{
  "detectionTime": "2019-04-15T00:00:00Z",
  "slicingParams": {
    "dimensionNames": ["EntityLOCATION"]
  },
  "timeseriesParams": {
    "forecastHistory": "2592000s",
    "granularity": "86400s"
  },
  "forecastParams": {
    "sensitivity": 0.1,
    "noiseThreshold": 250.0
  }
}

$ time gcurl -X POST -d @query.json https://timeseriesinsights.googleapis.com/v1/projects/timeseries-insights-api-demo/datasets/webnlp-201901-202104:query \
    | grep stringVal | wc -l

3

real 0m3.412s
user 0m0.681s
sys  0m0.047s

What's next