Generate a model

This page briefly covers concepts behind model training. An AML AI model resource represents a trained model that can be used to generate risk scores and explainability.

When to train or re-train

AML AI trains a model as part of creating a Model resource. The model must be trained before it can be evaluated (that is, backtested) or used to generate prediction results.

For best performance and to maintain the most up-to-date models, consider monthly re-training. However, a given engine version supports generating prediction results for 12 months from the release of a newer minor engine version.

How to train

To train a model (that is, create a model), see Create and manage models.

In particular, you need to select the following:

  • The data to use for training:

    Specify a dataset and an end time within the date range of the dataset.

    Training uses labels and features based on complete calendar months up to, but not including, the month of the selected end time. For more information, see Dataset time ranges.

  • An engine config created using a consistent dataset:

    See Configure an engine.

Training output

Training generates a Model resource, which can be used to do the following:

  • Create backtest results, which are used to evaluate model performance using currently-known true positives
  • Create prediction results, which are used once you are ready to start reviewing new cases for potential money laundering

The model metadata contains the missingness metric, which can be used to assess dataset consistency (for example, by comparing the missingness values of feature families from different operations)

Metric name Metric description Example metric value
Missingness

Share of missing values across all features in each feature family.

Ideally, all AML AI feature families should have a Missingness near to 0. Exceptions may occur where the data underlying those feature families is unavailable for integration.

A significant change in this value for any feature family between tuning, training, evaluation, and prediction can indicate inconsistency in the datasets used.

{
  "featureFamilies": [
    {
      "featureFamily": "unusual_wire_credit_activity",
      "missingnessValue": 0.00,
    },
    ...
    ...
    {
      "featureFamily": "party_supplementary_data_id_3",
      "missingnessValue": 0.45,
    },
  ],
}
Importance

A metric that shows the importance of a feature family to the model. Higher values indicate more significant use of the feature family in the model. A feature family that is not used in the model has zero importance.

Importance values can be used when prioritizing acting on family skew results. For example, the same skew value for a family with higher importance to the model is more urgent to resolve.

{
  "featureFamilies": [
    {
      "featureFamily": "unusual_wire_credit_activity",
      "importanceValue": 459761000000,
    },
    ...
    ...
    {
      "featureFamily": "party_supplementary_data_id_3",
      "importanceValue": 27492,
    },
  ],
}

Model metadata does not contain recall metrics from a test set. To generate recall measurements for a specific time period (for example, the test set), see Evaluate a model.