Overview of model preparation

This page leads you through the steps to prepare an AML AI model, assuming you have already set up an instance and prepared the necessary datasets.

Overview of stages

The process to prepare a model is in covered in the following three stages:

Once you have completed the above stages and model performance meets your needs, see the guidance in sections Generate risk scores and explainability and Prepare for model and risk governance.

Before you begin

Before you begin, you will need the following:

Dataset requirements

For detailed guidance on the data model and schema, see the pages under Prepare Data for AML AI. This section covers how to make sure that the datasets used in engine tuning, training, and evaluation work well together.

Dataset time ranges

The minimum time range of datasets for each operation is covered in Understand data scope and duration. In summary, a 0 to 24 month lookback window is required depending on the table, on top of a core time window of at least 18 months.

For example, for engine tuning, the Transaction table should cover at least 42 months (18 months core time window and 24 months for the lookback window).

Configuring an engine, training, and evaluation (backtesting) can be completed with a single dataset; see the following image. To ensure good production performance by avoiding overfitting, you should use a core time window for evaluation (that is, creating backtest results) that is disjoint and is more recent than the core time window for training (that is, creating a model).

Dataset time ranges for tuning, training, and backtesting

Dataset consistency

When using different datasets for the engine tuning, training, and evaluation stages, make the datasets consistent in which fields are populated and how they are populated. This is important for AML model stability and performance.

Similarly, for a high-quality risk score, the dataset used to create prediction results with a model should be consistent with the dataset used to train that model.

In particular, ensure the following:

  • The same logic is used to populate each field. Changing the logic used to populate a field can introduce feature skew between model training and prediction or evaluation.
  • The same selection of RECOMMENDED fields are populated. For example, removing a field that was populated during model training can cause features that the model relies on to be skewed or missing during evaluation or prediction.
  • The same logic is used to provide values. In the PartySupplementaryData table, the same logic is used to provide values for each party_supplementary_data_id field.

    • Using the same data, but with different party_supplementary_data_id values, causes the model to use data incorrectly. For example, a particular field uses ID 5 in the PartySupplementaryData table for one dataset, but then uses ID 7 in another dataset.
    • Removing a party_supplementary_data_id value that a model relies on may have unpredictable effects. For example, ID 3 is used in the PartySupplementaryData table in one dataset but is omitted from another dataset.

Now you have a dataset ready for engine tuning, training, and evaluation. Note that model operations can take tens of hours. For information on how to check if an operation is still running or has completed (failed or succeeded), see Manage long-running operations.