Evaluate a model

This page is valid for the engine versions within the following major engine version groupings. To view the page for other engine versions, use the selector at the top of this page.

In summary, here are the changes from engine versions v003 to v004 (up to v004.008):

Added ObservedRecallValuesPerTypology metric to the Backtest output.
Added partiesCount and identifiedPartiesCount to the ObservedRecallValues metric value.

Overview

Backtest results provide you with a summary of model performance in a specified timeframe. They are generated by predicting on all customers within a backtest period and evaluating model performance against available risk events.

Backtest results can be used to measure model performance on a separate time range from that used in training, or also over time to check for performance degradation.

How to backtest

To create a BacktestResult resource, see Create and manage backtest results.

In particular, you need to select the following:

The data to use for backtesting:

Specify a dataset and an end time within the date range of the dataset.

Training uses labels and features based on complete calendar months up to, but not including, the month of the selected end time. For more information, see Dataset time ranges.

Specify how many months of labeled data to use for backtesting (that is, the number of backtest periods).

Specify the number of parties to evaluate as part of the test using the PerformanceTarget field.
The volume of investigations expected based on the models:

Specify partyInvestigationsPerPeriodHint. Backtesting evaluates the AML AI model at a range of monthly investigation volumes, based on the amount you specify. For more information, see Backtest output.
A model created using a consistent dataset:

See Create a model.

Backtest periods

The backtestPeriods field specifies how many consecutive calendar months to use features and labels from in the performance evaluation of this model.

The following apply to backtest data:

The months used in evaluation are the most recent complete calendar months prior to the specified endTime. For example, if endTime is 2023-04-15T23:21:00Z and backtestPeriods is 5, then the labels from the following months are used: 2023-03, 2023-02, 2023-01, 2022-12, and 2022-11.
You should use the most recent available data for backtesting when evaluating a model in preparation for production use.
Backtest periods must be set to 3 or greater. Two months of the backtest period are reserved to account for repeat alerts, and the remaining months are used to generate positive labels for performance evaluation.

Note: Use five or more backtest periods, depending on the accuracy of the evaluation you're seeking. Using five backtest periods means that three periods are used for generating positive labels.
Avoid using overlapping months for training and backtesting as this risks overfitting. Make sure backtest and training end times are at least backtestPeriods apart. That is,

(backtest results end time month) >= (model end time month) + backtestPeriods

Optionally, you can also create prediction results for a model and conduct your own party-level analyses of model performance.

Backtest output

The backtest results metadata contains the following metrics. In particular, these metrics show you the following:

How the model performs compared to labels from a separate time period and for a variety of different investigation volumes or risk score thresholds

Note: Recall metrics only show the share of historical AML_EXIT events that would be alerted with a given model, and can be used to decide to proceed to live testing. Given that not all customers alerted by the model have previously been investigated, real-life performance is expected to be significantly better.
Measurements which can be used to assess dataset consistency (for example, by comparing the missingness values of feature families from different operations)

Metric name	Metric description	Example metric value
ObservedRecallValues	Recall metric measured on the dataset specified for backtesting. The API includes 20 of these measurements, at different operating points, evenly distributed from 0 (not included) until 2 * `partyInvestigationsPerPeriodHint`. The API adds a final recall measurement at `partyInvestigationsPerPeriodHint`. Alongside the recall value, we also provide the numerator and denominator as `partiesCount` and `identifiedPartiesCount` respectively.	{ "recallValues": [ { "partyInvestigationsPerPeriod": 5000, "recallValue": 0.80, "partiesCount": 60, "identifiedPartiesCount": 48, "scoreThreshold": 0.42, }, ... ... { "partyInvestigationsPerPeriod": 8000, "recallValue": 0.85, "partiesCount": 60, "identifiedPartiesCount": 51, "scoreThreshold": 0.30, }, ], }
ObservedRecallValuesPerTypology	Recall metric on a risk typology level measured on the dataset specified for backtesting. The measurements follow the same approach as `ObservedRecallValues`.	{ "recallValuesPerTypology": [ { "partyInvestigationsPerPeriod": 5000, "riskTypology": "risk_typology_id_1", "recallValue": 0.80, "partiesCount": 60, "identifiedPartiesCount": 48, "scoreThreshold": 0.42, }, { "partyInvestigationsPerPeriod": 8000, "riskTypology": "risk_typology_id_1", "recallValue": 0.90, "partiesCount": 60, "identifiedPartiesCount": 54, "scoreThreshold": 0.30, }, ... ... { "partyInvestigationsPerPeriod": 8000, "riskTypology": "risk_typology_id_2", "recallValue": 0.75, "partiesCount": 4 "identifiedPartiesCount": 3, "scoreThreshold": 0.30, }, ], }
Missingness	Share of missing values across all features in each feature family. Ideally, all AML AI feature families should have a Missingness near to 0. Exceptions may occur where the data underlying those feature families is unavailable for integration. A significant change in this value for any feature family between tuning, training, evaluation, and prediction can indicate inconsistency in the datasets used.	{ "featureFamilies": [ { "featureFamily": "unusual_wire_credit_activity", "missingnessValue": 0.00, }, ... ... { "featureFamily": "party_supplementary_data_id_3", "missingnessValue": 0.45, }, ], }
Skew	Metrics showing skew between training and prediction or backtest datasets. Family skew indicates changes in the distribution of feature values within a feature family, weighted by importance of the feature within that family. Max skew indicates the maximum skew of any feature within that family. Skew values range from 0, representing no significant change in the distribution of values of features in the family, to 1 for the most significant change. A large value for either family skew or max skew indicates a significant change in the structure of your data in a way that may impact model performance. Family skew takes the value -1 when no features in the family are used by the model. For large skew values, you should do one of the following: Investigate changes in the data used by that feature family (see model governance support materials) and fix any input data issues Retrain a model on more recent data You should set thresholds for acting on family and max skew values based on observing the natural variation in skew metrics over several months.	{ "featureFamilies": [ { "featureFamily": "unusual_wire_credit_activity", "familySkewValue": 0.10, "maxSkewValue": 0.14, }, ... ... { "featureFamily": "party_supplementary_data_id_3", "familySkewValue": 0.11, "maxSkewValue": 0.11, }, ], }