Tutorial

We use a small dataset provided by Kalev Leetaru to illustrate the Timeseries Insights API. The dataset is derived from The GDELT Project, a global database tracking world events and media coverage. This dataset contains entity mentionings in news URLs in April 2019.

Objectives

  • Learn the data format for Timeseries Insights API.
  • Learn how to create, query, update and delete datasets.

Costs

There is no cost for Preview.

Before you begin

Set up a Cloud project and enable Timeseries Insights API following Getting Started.

Tutorial dataset

The dataset includes entity annotations of locations, organizations, persons, among others.

The Timeseries Insights API takes JSON format inputs. A sample Event for this dataset is

{
  "groupId":"-6180929807044612746",
  "dimensions":[{"name":"EntityORGANIZATION","stringVal":"Medina Gazette"}],
  "eventTime":"2019-04-05T08:00:00+00:00"
}

Each event must have an eventTime field for the event timestamp and a long-valued groupId to mark related events. Event properties are included as dimensions, each of which has a name and one of stringVal, boolVal, longVal, or doubleVal.

NOTE: Google Cloud APIs accept both camel case (like camelCase) and snake case (like snake_case) for JSON field names. The documentations are mostly written as camel case.

NOTE: Since JSON long values (numbers) are actually float values with only integer precisions, both groupId and longVal are effectively limited to 53 binary digits if JSON uses numbers. To provide int64 data, the JSON value should be quoted as a string. A groupId is typically a numerical ID or generated with a deterministic hash function, satisfying the above restriction.

NOTE: The name and stringVal fields are supposed to be alphanumerical values including '_'. Special characters including the space are not supported.

NOTE: When reading from a static Google Cloud Storage data source each JSON event is supposed to be a single line as follows:

{"groupId":"-6180929807044612746","dimensions":[{"name":"EntityORGANIZATION","stringVal":"Medina Gazette"}],"eventTime":"2019-04-05T08:00:00+00:00"}

List datasets

projects.datasets.list shows all datasets under ${PROJECT_ID}. Note gcurl is an alias and PROJECT_ID is an environment variable, both set up in Getting Started.

$ gcurl https://timeseriesinsights.googleapis.com/v1/projects/${PROJECT_ID}/datasets

The result is a JSON string like

{
  "datasets": [
    {
      "name": "example",
      "state": "LOADED",
      ...
    },
    {
      "name": "dataset_tutorial",
      "state": "LOADING",
      ...
    }
  ]
}

The results show the datasets currently under the project. The state field indicates whether the dataset is ready to be used. When a dataset is just created, it is in state LOADING until the indexing completes, then transitions to LOADED state. If any errors occur during creation and indexing, it will be in FAILED state. The result also include the complete dataset information from the original create request.

Create dataset

projects.datasets.create adds a new dataset to the project.

$ gcurl -X POST -d @create.json https://timeseriesinsights.googleapis.com/v1/projects/${PROJECT}/datasets

where create.json contains:

{
  name: "dataset_tutorial",
  streaming: true,
  ttl: "8640000s",
  dataNames: [
    "EntityCONSUMER_GOOD",
    "EntityEVENT",
    "EntityLOCATION",
    "EntityORGANIZATION",
    "EntityOTHER",
    "EntityPERSON",
    "EntityUNKNOWN",
    "EntityWORK_OF_ART",
  ],
  dataSources: [
    {uri: "gs://data.gdeltproject.org/blog/2021-timeseries-insights-api/datasets/webnlp-201904.json"}
  ]
}

This request create a dataset named dataset_tutorial from GCS dataSources, which contain Event data in JSON format. Only dimensions listed in dataNames are indexed and used by the system. If streaming=true, the dataset also accepts streaming updates after the initial indexing completes. Streaming updates older than ttl are ignored though.

The create request returns success if it is accepted by the API server. The dataset will be in LOADING state until indexing completes, then the state becomes LOADED and starts accepting queries and updates if any.

Query dataset

projects.datasets.query performs anomaly detection queries.

$ gcurl -X POST -d @query.json https://timeseriesinsights.googleapis.com/v1/projects/${PROJECT}/datasets/dataset_tutorial:query

where query.json contains:

{
  dimensionNames: ["EntityLOCATION"],
  testedInterval: {
    startTime: "2019-04-15T00:00:00Z",
    length: "86400s"
  },
  forecastParams: {
    holdout: 10,
    minDensity: 0,
    forecastHistory: "1209600s",
    maxPositiveRelativeChange: 1,
    maxNegativeRelativeChange: 1,
    forecastExtraWeight: 0,
    seasonalityHint: "DAILY",
  },
  returnNonAnomalies: true
}

We want to detect if any anomalies happened during the testedInterval for slices split across the dimensions given by dimensionNames. A slice is a subset of the events in the dataset with fixed values for some of their dimensions. For example, {"name": "EntityLOCATION","stringVal": "Seine River"} is a slice. Any subset of dataNames from the dataset definition can be used as dimensionNames, the API will aggregates event over the unmentioned dimensions. This is similar to a "group-by" operation with "count(*)" in SQL queries.

Events in each of the slices are aggregated based on forecastParams.aggregatedDimension. If this field is empty, we will simply count all the events in the slice. If it is not empty, then we expect this field to be a valid numerical dimension name present in the events in this slice. The numerical values are summed together to form the time series.

We will analyze each slice to check if it's an anomaly by:

  1. Forming a time series that goes from testedInterval.startTime - forecastParams.forecastHistory to testedInterval.startTime + testedInterval.length, in which each data point is a time bucket of length testedInterval.length and has their value given by aggregating the events in that bucket as specified by the aggregatedDimension parameter. If the time series does not have enough data points (as specified by the minDensity parameter), we will stop analyzing it.
  2. Once we have computed the time series for the slice, we will analyze it by using common forecasting techniques. The first (100 - holdout)% of the time series is used to train a prediction model and the last holdout% is used to test the quality of the model. Based on the error metrics, we will compute confidence bounds for the tested interval and if the actual value is outside these bounds by more than configured, we will mark it as an anomaly. To configure how much the actual value can be outside the bounds, check the maxPositiveRelativeChange, maxNegativeRelativeChange and forecastExtraWeight parameters.

The query result looks like follows:

{
  "name": "projects/timeseries-staging/datasets/dataset_tutorial",
  "anomalyDetectionResult": {
    "anomalies": [
      {
        "dimensions": [
          {
            "name": "EntityLOCATION",
            "stringVal": "Ile de la Cite"
          }
        ],
        "result": {
          "holdoutErrors": {},
          "trainingErrors": {
            "mdape": 1,
            "rmd": 1
          },
          "forecastStats": {
            "density": "23",
            "numAnomalies": 1
          },
          "testedIntervalActual": 440,
          "testedIntervalForecastLowerBound": -1,
          "testedIntervalForecastUpperBound": 1
        },
        "status": {}
      },
      {
        "dimensions": [
          {
            "name": "EntityLOCATION",
            "stringVal": "Seine"
          }
        ],
        "result": {
          "holdoutErrors": {
            "mdape": 0.1428571428571429,
            "rmd": 0.1428571428571429
          },
          "trainingErrors": {
            "mdape": 0.84615384615384626,
            "rmd": 0.62459546925566334
          },
          "forecastStats": {
            "density": "85",
            "numAnomalies": 1
          },
          "testedIntervalActual": 586,
          "testedIntervalForecast": 9.3333333333333339,
          "testedIntervalForecastLowerBound": 8,
          "testedIntervalForecastUpperBound": 10.666666666666668
        },
        "status": {}
      },
      {
        "dimensions": [
          {
            "name": "EntityLOCATION",
            "stringVal": "Notre Dame"
          }
        ],
        "result": {
          "holdoutErrors": {
            "mdape": 0.42857142857142855,
            "rmd": 0.42857142857142855
          },
          "trainingErrors": {
            "mdape": 0.19999999999999996,
            "rmd": 0.65055762081784374
          },
          "forecastStats": {
            "density": "100",
            "numAnomalies": 1
          },
          "testedIntervalActual": 790,
          "testedIntervalForecast": 7,
          "testedIntervalForecastLowerBound": 4,
          "testedIntervalForecastUpperBound": 10
        },
        "status": {}
      },
      ...
    ],
    "nonAnomalies": [
      ...
    ]
  }
}

It contains anomalies and, optionally, evaluated slices that were not marked as anomalies, in the same ForecastSlice format. result shows the time of the anomaly, actual value, and the range of forecast values. trainingErrors and holdoutErrors show additional metrics used for anomaly detection.

Streaming update

projects.datasets.appendEvents adds Event records in a streaming fashion if the create request specifies streaming: true.

$ gcurl -X POST -d @append.json https://timeseriesinsights.googleapis.com/v1/projects/${PROJECT}/datasets/dataset_tutorial:appendEvents

where append.json contains:

{
  events: [
    {
      "groupId":"-5379487492185488040",
      "dimensions":[{"name":"EntityPERSON","stringVal":"Jason Marsalis"}],
      "eventTime":"2021-06-01T15:45:00+00:00"
    },{
      "groupId":"1324354349507023708",
      "dimensions":[{"name":"EntityORGANIZATION","stringVal":"WAFA"}],
      "eventTime":"2021-06-02T04:00:00+00:00"
    }
  ]
}

Streamed updates get indexed near-realtime so changes can respond quickly in query results.

Delete dataset

projects.datasets.delete marks the dataset for deletion.

$ gcurl -X DELETE https://timeseriesinsights.googleapis.com/v1/projects/${PROJECT}/datasets/dataset_tutorial

The request returns immediately, and the dataset will not accept additional queries or updates. It may take sometime before the data is completely removed from the service, after which List datasets will not return this dataset.

What's next

Some other examples can be found on the GDELT website.