Tutorial

We use a small dataset provided by Kalev Leetaru to illustrate the Timeseries Insights API. The dataset is derived from The GDELT Project, a global database tracking world events and media coverage. This dataset contains entity mentionings in news URLs in April 2019.

Objectives

  • Learn the data format for Timeseries Insights API.
  • Learn how to create, query, update and delete datasets.

Costs

There is no cost for Preview.

Before you begin

Set up a Cloud project and enable Timeseries Insights API following Getting Started.

Tutorial dataset

The dataset includes entity annotations of locations, organizations, persons, among others.

The Timeseries Insights API takes JSON format inputs. A sample Event for this dataset is

{
  "groupId":"-6180929807044612746",
  "dimensions":[{"name":"EntityORGANIZATION","stringVal":"Medina Gazette"}],
  "eventTime":"2019-04-05T08:00:00+00:00"
}

Each event must have an eventTime field for the event timestamp and a long-valued groupId to mark related events. Event properties are included as dimensions, each of which has a name and one of stringVal, boolVal, longVal, or doubleVal.

NOTE: Google Cloud APIs accept both camel case (like camelCase) and snake case (like snake_case) for JSON field names. The documentations are mostly written as camel case.

NOTE: Since JSON long values (numbers) are actually float values with only integer precisions, both groupId and longVal are effectively limited to 53 binary digits if JSON uses numbers. To provide int64 data, the JSON value should be quoted as a string. A groupId is typically a numerical ID or generated with a deterministic hash function, satisfying the above restriction.

NOTE: The name and stringVal fields are supposed to be alphanumerical values including '_'. Special characters including the space are not supported.

NOTE: When reading from a static Google Cloud Storage data source each JSON event is supposed to be a single line as follows:

{"groupId":"-6180929807044612746","dimensions":[{"name":"EntityORGANIZATION","stringVal":"Medina Gazette"}],"eventTime":"2019-04-05T08:00:00+00:00"}

List datasets

projects.datasets.list shows all datasets under ${PROJECT_ID}. Note gcurl is an alias and PROJECT_ID is an environment variable, both set up in Getting Started.

$ gcurl https://timeseriesinsights.googleapis.com/v1/projects/${PROJECT_ID}/datasets

The result is a JSON string like

{
  "datasets": [
    {
      "name": "example",
      "state": "LOADED",
      ...
    },
    {
      "name": "dataset_tutorial",
      "state": "LOADING",
      ...
    }
  ]
}

The results show the datasets currently under the project. The state field indicates whether the dataset is ready to be used. When a dataset is just created, it is in state LOADING until the indexing completes, then transitions to LOADED state. If any errors occur during creation and indexing, it will be in FAILED state. The result also include the complete dataset information from the original create request.

Create dataset

projects.datasets.create adds a new dataset to the project.

$ gcurl -X POST -d @create.json https://timeseriesinsights.googleapis.com/v1/projects/${PROJECT}/datasets

where create.json contains:

{
  name: "dataset_tutorial",
  streaming: true,
  ttl: "8640000s",
  dataNames: [
    "EntityCONSUMER_GOOD",
    "EntityEVENT",
    "EntityLOCATION",
    "EntityORGANIZATION",
    "EntityOTHER",
    "EntityPERSON",
    "EntityUNKNOWN",
    "EntityWORK_OF_ART",
  ],
  dataSources: [
    {uri: "gs://data.gdeltproject.org/blog/2021-timeseries-insights-api/datasets/webnlp-201904.json"}
  ]
}

This request create a dataset named dataset_tutorial from GCS dataSources, which contain Event data in JSON format. Only dimensions listed in dataNames are indexed and used by the system. If streaming=true, the dataset also accepts streaming updates after the initial indexing completes. Streaming updates older than ttl are ignored though.

The create request returns success if it is accepted by the API server. The dataset will be in LOADING state until indexing completes, then the state becomes LOADED and starts accepting queries and updates if any.

Query dataset

projects.datasets.query performs anomaly detection queries.

$ gcurl -X POST -d @query.json https://timeseriesinsights.googleapis.com/v1/projects/${PROJECT}/datasets/dataset_tutorial:query

where query.json contains:

{
  "detectionTime": "2019-04-15T00:00:00Z",
  "slicingParams": {
    "dimensionNames": ["EntityLOCATION"]
  },
  "timeseriesParams": {
    "forecastHistory": "1209600s",
    "granularity": "86400s",
  },
  "forecastParams": {
    "sensitivity": 0.1,
    "noiseThreshold": 100.0
  },
  "returnNonAnomalies": true
}

The query result looks like follows:

{
  "name": "projects/timeseries-staging/datasets/dataset_tutorial",
  "anomalyDetectionResult": {
    "anomalies": [
      {
        "dimensions": [
          {
            "name": "EntityLOCATION",
            "stringVal": "Ile de la Cite"
          }
        ],
        "result": {
          "holdoutErrors": {},
          "trainingErrors": {
            "mdape": 1,
            "rmd": 1
          },
          "forecastStats": {
            "density": "23",
            "numAnomalies": 1
          },
          "detectionPointActual": 440,
          "detectionPointForecastLowerBound": -1,
          "detectionPointForecastUpperBound": 1
        },
        "status": {}
      },
      {
        "dimensions": [
          {
            "name": "EntityLOCATION",
            "stringVal": "Seine"
          }
        ],
        "result": {
          "holdoutErrors": {
            "mdape": 0.1428571428571429,
            "rmd": 0.1428571428571429
          },
          "trainingErrors": {
            "mdape": 0.84615384615384626,
            "rmd": 0.62459546925566334
          },
          "forecastStats": {
            "density": "85",
            "numAnomalies": 1
          },
          "detectionPointActual": 586,
          "detectionPointForecast": 9.3333333333333339,
          "detectionPointForecastLowerBound": 8,
          "detectionPointForecastUpperBound": 10.666666666666668
        },
        "status": {}
      },
      {
        "dimensions": [
          {
            "name": "EntityLOCATION",
            "stringVal": "Notre Dame"
          }
        ],
        "result": {
          "holdoutErrors": {
            "mdape": 0.42857142857142855,
            "rmd": 0.42857142857142855
          },
          "trainingErrors": {
            "mdape": 0.19999999999999996,
            "rmd": 0.65055762081784374
          },
          "forecastStats": {
            "density": "100",
            "numAnomalies": 1
          },
          "detectionPointActual": 790,
          "detectionPointForecast": 7,
          "detectionPointForecastLowerBound": 4,
          "detectionPointForecastUpperBound": 10
        },
        "status": {}
      },
      ...
    ],
    "nonAnomalies": [
      ...
    ]
  }
}

Streaming update

projects.datasets.appendEvents adds Event records in a streaming fashion if the create request specifies streaming: true.

$ gcurl -X POST -d @append.json https://timeseriesinsights.googleapis.com/v1/projects/${PROJECT}/datasets/dataset_tutorial:appendEvents

where append.json contains:

{
  events: [
    {
      "groupId":"1324354349507023708",
      "dimensions":[{"name":"EntityPERSON","stringVal":"Jason Marsalis"}],
      "eventTime":"2021-06-01T15:45:00+00:00"
    },{
      "groupId":"1324354349507023708",
      "dimensions":[{"name":"EntityORGANIZATION","stringVal":"WAFA"}],
      "eventTime":"2021-06-02T04:00:00+00:00"
    }
  ]
}

Streamed updates get indexed near-real time so changes can respond quickly in query results. All events sent by a single projects.datasets.appendEvents request must have the same groupdId.

Delete dataset

projects.datasets.delete marks the dataset for deletion.

$ gcurl -X DELETE https://timeseriesinsights.googleapis.com/v1/projects/${PROJECT}/datasets/dataset_tutorial

The request returns immediately, and the dataset will not accept additional queries or updates. It may take sometime before the data is completely removed from the service, after which List datasets will not return this dataset.

What's next

Some other examples can be found on the GDELT website.