Backtest results provide you with a summary of model performance in a specified timeframe. This can be used to measure model performance on a separate time range from that used in training, or also over time to check for performance degradation.
How to backtest
To create a BacktestResult resource, see Create and manage backtest results.
In particular, you need to select the following:
The data to use for backtesting:
Specify a dataset and an end time within the date range of the dataset.
Training uses labels and features based on complete calendar months up to, but not including, the month of the selected end time. For more information, see Dataset time ranges.
Specify how many months of labeled data to use for backtesting (that is, the number of backtest periods).
A model created using a consistent dataset:
See Configure an engine.
Backtest periods
The
backtestPeriods
field specifies how many consecutive calendar months to use features and labels
from in the performance evaluation of this model.
The following apply to backtest data:
- The months used in evaluation are the most recent complete calendar months
prior to the specified
endTime
. For example, ifendTime
is2023-04-03T23:21:00Z
andbacktestPeriods
is5
, then the labels from the following months are used: 2023-03, 2023-02, 2023-01, 2022-12, and 2022-11. - You should use the most recent available data for backtesting when evaluating a model in preparation for production use.
Backtest periods must be set to
3
or greater. A minimum of 3 months is required so AML AI can account for repeat alerts when estimating investigations per period.Avoid using overlapping months for training and backtesting as this risks overfitting. Make sure backtest and training end times are at least
backtestPeriods
apart. That is,(backtest results end time month) >= (model end time month) +
backtestPeriods
Optionally, you can also create prediction results for a model and conduct your own party-level analyses of model performance.
Backtest output
The backtest results metadata contains the following metrics. In particular, these metrics show you the following:
How the model performs compared to labels from a separate time period and for a variety of different investigation volumes or risk score thresholds
Any large changes in which feature families the dataset supports (between engine tuning, training, evaluation, and prediction)
Metric name | Metric description | Example metric value |
---|---|---|
ObservedRecallValues | Recall metric measured on the dataset specified for backtesting. The API
includes 20 of these measurements, at different operating points, evenly
distributed from 0 (not included) until 2 *
partyInvestigationsPerPeriodHint . The API adds a final recall
measurement at partyInvestigationsPerPeriodHint .
|
{ "recallValues": [ { "partyInvestigationsPerPeriod": 5000, "recallValue": 0.80, "scoreThreshold": 0.42, }, ... ... { "partyInvestigationsPerPeriod": 8000, "recallValue": 0.85, "scoreThreshold": 0.30, }, ], } |
Missingness |
Share of missing values across all features in each feature family. Ideally, all AML AI feature families should have a Missingness near to 0. Exceptions may occur where the data underlying those feature families is unavailable for integration. A significant change in this value for any feature family between tuning, training, evaluation, and prediction can indicate inconsistency in the datasets used. |
{ "featureFamilies": [ { "featureFamily": "unusual_wire_credit_activity", "missingnessValue": 0.00, }, ... ... { "featureFamily": "party_supplementary_data_id_3", "missingnessValue": 0.45, }, ], } |