Scoring guide

The Cloud Inference API returns a set of distributions in response to queries. Each entry in a distribution is an event with a score. The core of Inference scoring is conditional probability: the probability of an event occurring within your dataset given that the query also occurs. These scores can be summarized by the expression:

\[ \frac{P(event \mid query)}{P(event)^{exp}} \]

Background probability exponent

The \( exp \) in this expression is the bgprobExp , a key parameter that allows control over how the background probability is incorporated into the score. The background probability is simply the probability of the event occurring for a random group in the dataset (irrespective of the query occurring).

When the background parameter is 0, the raw conditional probability \( P(event \mid query) \) is returned. When the parameter is 1, a pure ratio is returned, \( \frac{P(event \mid query)}{P(event)} \), called a lift score. The lift score describes how much more or less likely the co-occurring event is when compared to the baseline.

The default value of bgprobExp, 0.7, is a blend between these two extremes. This tunes the scores to return events that are unusual in the context of your dataset, but still gives some scoring weight to event popularity.

Example from GDelt

Using the gdelt_2018_04_data example from the quickstart will help to illustrate how bgprobExp can reveal different aspects of a dataset. Try a request with this compound query that sets bgprobExp to 0.0. This query selects article groups where the tagged news images show joyful faces and the text originates in the United Kingdom.

{
  "name": "gdelt_2018_04_data",
  "queries": [{
    "query": {
      "type": "TYPE_AND",
      "children": [{
        "type": "TYPE_TERM",
        "term": {
          "name": "ImageFaceToneHas",
          "value": "Joy"
        }
      },{
        "type": "TYPE_TERM",
        "term": {
          "name": "PageTextGeo",
          "value": "United Kingdom"
        }
      }]
    },
    "distribution_configs": {
      "data_name": "ImageWebEntity",
      "bgprobExp": 0.0,
      "max_result_entries": 5
    }
  }]
}

Since bgprobExp is set to zero, the set of scored results returned will be pure conditional probability, without taking into account the background popularity of the returned terms. This gives an accurate but generic view of the articles matching your query.

The top entry in the returned results is a fairly generic label, with a very high group count:

            {
              "value": "ImageWebEntity=Socialite",
              "score": 0.13140087,
              "matchedGroupCount": "7899",
              "totalGroupCount": "123396"
            },

This distribution has a matchedGroupCount of 59079, resulting in an event score of ~0.13.

Run the query again, but set the bgprobExp to the default value of 0.7. The results now take into account the background probability of returned events: \( \frac{P(event \mid query)}{P(event)^{0.7}} \). Unusual events will have relatively higher scores.

The top entry is now a more rare event with much higher relevance to the query (joyful face && United Kingdom). The score now resembles a "lift" ratio more closely than a pure conditional probability.

            {
              "value": "ImageWebEntity=Catherine_Duchess_of_Cambridge",
              "score": 5.0441356,
              "matchedGroupCount": "1133",
              "totalGroupCount": "2478"
            },

Probability of rare terms

The Inference API may return conditional probabilities \( P(event \mid query) \) that are lower than expected for the raw group counts in your data. The Inference API is designed to avoid returning very rare and possibly noisy events. Instead of the direct probability estimate, the lower bound of a 90% confidence interval is returned. For rare events this may be substantially lower than the estimate based on group count alone.

Timespan parameters

By default the Inference API will consider \( P(event \mid query) \) in terms of whole groups: if a group matches the query, the entire set of events in the group is considered to co-occur with the query. Setting the timespan parameters max_before_timespan and max_after_timespan can restrict which events are aggregated to a more specific set.

Within a group, each event that matches the query is considered a "hit." If the timespan parameters are specified, aggregation will occur only within the time limits specified in the parameters. This allows, for example, aggregation only after an event, only before, or only within a finite time limit in either direction.