Overview of model preparation

This page leads you through the steps to prepare an AML AI model, assuming you have already set up an instance and prepared the necessary datasets.

Overview of stages

The process to prepare a model is in covered in the following three stages:

Once you have completed the earlier stages and model performance meets your needs, see the guidance in sections Generate risk scores and explainability and Prepare for model and risk governance.

Before you begin

Before you begin, you will need the following:

Dataset requirements

For detailed guidance on the data model and schema, see the pages under Prepare Data for AML AI. This section covers how to make sure that the datasets used in engine tuning, training, and evaluation work well together.

Dataset time ranges

Each dataset used for tuning, training, backtesting and prediction operations should contain valid data for a time range ending at the end of the last full calendar month prior to the end_time specified in the API call. The length of this time range depends on the table, Engine Version and operation. The minimum time range is covered in detail in Understand data scope and duration.

For example, for engine tuning with v004.004 engine versions, the Transaction table should cover at least 30 months.

Configuring an engine, training, and evaluation (backtesting) can be completed with a single dataset; see the following image. To ensure good production performance by avoiding overfitting, you should ensure that the period used for evaluation (that is, creating backtest results) is after the period used for training (that is, creating a model).

For example: if using 3 periods for backtesting and using periods up to the end of February 2024 for training (that is, end time in early March 2024), then you could use periods up to the end of May 2024 for backtesting (that is, end time in early June 2024).

Dataset time ranges for tuning, training, and backtesting

Dataset consistency

When using different datasets for the engine tuning, training, and evaluation stages, make the datasets consistent in which fields are populated and how they are populated. This is important for AML model stability and performance.

Similarly, for a high-quality risk score, the dataset used to create prediction results with a model should be consistent with the dataset used to train that model.

In particular, ensure the following:

  • The same logic is used to populate each field. Changing the logic used to populate a field can introduce feature skew between model training and prediction or evaluation.
  • The same selection of RECOMMENDED fields are populated. For example, removing a field that was populated during model training can cause features that the model relies on to be skewed or missing during evaluation or prediction.
  • The same logic is used to provide values. In the PartySupplementaryData table, the same logic is used to provide values for each party_supplementary_data_id field.

    • Using the same data, but with different party_supplementary_data_id values, causes the model to use data incorrectly. For example, a particular field uses ID 5 in the PartySupplementaryData table for one dataset, but then uses ID 7 in another dataset.
    • Removing a party_supplementary_data_id value that a model relies on may have unpredictable effects. For example, ID 3 is used in the PartySupplementaryData table in one dataset but is omitted from another dataset.

Now you have a dataset ready for engine tuning, training, and evaluation. Note that model operations can take tens of hours. For information on how to check if an operation is still running or has completed (failed or succeeded), see Manage long-running operations.