Gathering and merging the core data needed to run AML AI may be error-prone. To address this, AML AI has built-in data validation checks to provide you actionable feedback on how to address dataset-related issues.
The data validation checks are executed during two phases of the model deployment process, with any resulting errors included in the long-running operation (LRO) response.
- One check is executed as part of the creation of the dataset.
- Validation is also run at the beginning of other operations (tune, train, backtest, and predict).
The LRO
Status
contains one
ErrorInfo
message for each failed check. The ErrorInfo
message's reason
field contains
a stable string constant while other relevant information is provided in the
metadata
field. For more information on RPC errors, see
AIP-193.
Data validation checks and error output
Every validation failure contains the following corresponding error information:
Field | Description |
---|---|
reason |
Unique identifier for this type of error |
metadata["count"] |
Number of occurrences of this error |
metadata["data_table"]
|
The name of the input table in which the error occurred |
metadata["data_field"]
|
The name of the field in the input table in which the error occurred |
metadata["description"]
|
Detailed, actionable error description exposed in metadata |
metadata["test"]
|
Illustrative (pseudo-SQL) statement explaining the logic of the validation |
The following table lists all data validation checks performed by AML AI, their descriptions, and an example test response:
reason |
metadata["description"] |
metadata["test"] |
---|---|---|
NOT_NULL_COLUMN_WITH_NULLS |
One or more NOT NULL columns contain one or more null values. | X IS NULL |
DATE_TIME_DIFFERENCE |
The validity_start_time cannot include dates that are greater than today's date and validity_start_time must be greater than threshold. |
DATETIME_DIFF(CURRENT_TIMESTAMP(), validity_start_time, DAY) < 0 |
EXCESSIVE_ACCOUNTS_FOR_PARTY |
Number of accounts for the party exceeds the predefined threshold. | COUNT(DISTINCT account_id) > {{ var('overlarge_account_count')}} |
EXCESSIVE_PARTIES_FOR_SHARED_ACCOUNT |
Number of account holders for the account exceeds the predefined threshold. | COUNT(DISTINCT party_id) > {{ var('overlarge_account_holders')}} GROUP BY account_id |
MISSING_AML_EXIT_LABELS |
All, or no parties have AML exit events. Useful AML model labels cannot be created. | COUNT(party_id) WHERE type IN {{ positive_event_types}}) IN (0, count(party) |
DUPLICATE_PRIMARY_KEY |
There is a duplicate primary key value in the database resulting in a unique key violation. Note that for tables with validity_start_time , the primary key includes validity_start_time . |
GROUP BY X, validity_start_time HAVING count(1) > 1 |
NAN_VALUE_IN_FLOAT_COLUMN |
One or more columns contains a NaN (Not a Number) floating point value. | X is NAN |
INSUFFICIENT_DATE_RANGE |
The date range in the dataset specifies an insufficient number of months for any AML AI operation. The sufficient number of months for prediction is 24, and more for other operations. | COUNTIF((MAX({{transaction_time_column}}) - MIN({{transaction_time_column}})) > 3 years)/COUNT(*) < {{ var('short_timeframe_ratio') }} GROUP BY account_id |
EMPTY_TABLE |
One or more required tables in the database is empty. | COUNT(*) FROM X < 1 |
UNSUPPORTED_VALUE |
One or more columns include values that are not in the set of allowed values. | X NOT IN ("a1", "b2", "c3") |
DUPLICATE_RISK_CASE_EVENTS_TYPE_AML_EXIT |
Party was exited from the bank multiple times. AML_EXIT risk case event type is not allowed to occur multiple times. |
party_id, risk_case_id, countif(type = "AML_EXIT") > 1 |
DUPLICATE_RISK_CASE_EVENTS_TYPE_AML_PROCESS_START |
Multiple AML investigation processes were initiated against the party in this risk case. | party_id, risk_case_id, countif(type = "AML_PROCESS_START") > 1 |
DUPLICATE_RISK_CASE_EVENTS_TYPE_AML_PROCESS_END |
Multiple AML investigation processes were closed against the party in this risk case. | Party_id, risk_case_id countif(type = "AML_PROCESS_END") > 1 |
UNNORMALIZED_BOOKED_AMOUNT_CURRENCY |
Normalized booked amount includes multiple currencies in the Transaction table. All normalized amounts need to be in the same currency. | COUNT(DISTINCT normalized_booked_amount.currency_code) != 1 |
NEGATIVE_TRANSACTION_NORMALIZED_BOOKED_AMOUNT |
Normalized booked amount value for one or more transactions is negative. Data schema prohibits negative values for this field. | normalized_booked_amount.units < 0 OR normalized_booked_amount.nanos < 0 |
COLUMN_EXISTENCE |
One or more required columns don't exist in the database. | X DOES NOT EXIST IN {{table_name}} |
TABLE_EXISTENCE |
One or more required tables don't exist in the database. | TABLE {{table_name}} DOES NOT EXIST |
SUPPLEMENTARY_DATA_COLUMN_NAMES |
One or more party_supplementary_data_id values is not in the range of allowed values. IDs may use alphanumeric characters as well as underscores, and should start with an alphanumeric character. |
NOT EXISTS REGEXP_CONTAINS(party_supplementary_data_id, "^[a-zA-Z0-9][a-zA-Z0-9_]*$") AND supplementary_data_payload.value IS NOT NULL |
EXCESSIVE_PARTY_SUPPLEMENTARY_DATA_IDS |
Number of distinct party_supplementary_data_id values exceeds the maximum of 100. | COUNT(DISTINCT party_supplementary_data_id) > 100 |
MISSING_PARTY_SUPPLEMENTARY_DATA_ID |
One or more party supplementary data IDs that was present in the dataset used for model creation is missing in this dataset. | X NOT IN (DISTINCT {party_supplementary_data_ids used in model}) |
Accessing data validation errors
Data validation errors are included in the LRO response. See Manage long-running operations for more information. The platform logs also contain an entry with the LRO response.