Data validation errors

Gathering and merging the core data needed to run AML AI may be error-prone. To address this, AML AI has built-in data validation checks to provide you actionable feedback on how to address dataset-related issues.

The data validation checks are executed during two phases of the model deployment process, with any resulting errors included in the long-running operation (LRO) response.

  • One check is executed as part of the creation of the dataset.
  • Validation is also run at the beginning of other operations (tune, train, backtest, and predict).

The LRO Status contains one ErrorInfo message for each failed check. The ErrorInfo message's reason field contains a stable string constant while other relevant information is provided in the metadata field. For more information on RPC errors, see AIP-193.

Data validation checks and error output

Every validation failure contains the following corresponding error information:

Field Description
reason Unique identifier for this type of error
metadata["count"] Number of occurrences of this error
metadata["data_table"] The name of the input table in which the error occurred
metadata["data_field"] The name of the field in the input table in which the error occurred
metadata["description"] Detailed, actionable error description exposed in metadata
metadata["test"] Illustrative (pseudo-SQL) statement explaining the logic of the validation

The following table lists all data validation checks performed by AML AI, their descriptions, and an example test response:

reason metadata["description"] metadata["test"]
NOT_NULL_COLUMN_WITH_NULLS One or more NOT NULL columns contain one or more null values. X IS NULL
DATE_TIME_DIFFERENCE The validity_start_time cannot include dates that are greater than today's date and validity_start_time must be greater than threshold. DATETIME_DIFF(CURRENT_TIMESTAMP(), validity_start_time, DAY) < 0
EXCESSIVE_ACCOUNTS_FOR_PARTY Number of accounts for the party exceeds the predefined threshold. COUNT(DISTINCT account_id) > {{ var('overlarge_account_count')}}
EXCESSIVE_PARTIES_FOR_SHARED_ACCOUNT Number of account holders for the account exceeds the predefined threshold. COUNT(DISTINCT party_id) > {{ var('overlarge_account_holders')}} GROUP BY account_id
MISSING_AML_EXIT_LABELS All, or no parties have AML exit events. Useful AML model labels cannot be created. COUNT(party_id) WHERE type IN {{ positive_event_types}}) IN (0, count(party)
DUPLICATE_PRIMARY_KEY There is a duplicate primary key value in the database resulting in a unique key violation. Note that for tables with validity_start_time, the primary key includes validity_start_time. GROUP BY X, validity_start_time HAVING count(1) > 1
NAN_VALUE_IN_FLOAT_COLUMN One or more columns contains a NaN (Not a Number) floating point value. X is NAN
INSUFFICIENT_DATE_RANGE (Only up to v004.002) The date range in the dataset specifies an insufficient number of months for any AML AI operation. The sufficient number of months for prediction is 24, and more for other operations. COUNTIF((MAX({{transaction_time_column}}) - MIN({{transaction_time_column}})) > 3 years)/COUNT(*) < {{ var('short_timeframe_ratio') }} GROUP BY account_id
EMPTY_TABLE One or more required tables in the database is empty. COUNT(*) FROM X < 1
UNSUPPORTED_VALUE One or more columns include values that are not in the set of allowed values. X NOT IN ("a1", "b2", "c3")
DUPLICATE_RISK_CASE_EVENTS_TYPE_AML_EXIT Party was exited from the bank multiple times. AML_EXIT risk case event type is not allowed to occur multiple times. party_id, risk_case_id, countif(type = "AML_EXIT") > 1
DUPLICATE_RISK_CASE_EVENTS_TYPE_AML_PROCESS_START Multiple AML investigation processes were initiated against the party in this risk case. party_id, risk_case_id, countif(type = "AML_PROCESS_START") > 1
DUPLICATE_RISK_CASE_EVENTS_TYPE_AML_PROCESS_END Multiple AML investigation processes were closed against the party in this risk case. Party_id, risk_case_id countif(type = "AML_PROCESS_END") > 1
UNNORMALIZED_BOOKED_AMOUNT_CURRENCY Normalized booked amount includes multiple currencies in the Transaction table. All normalized amounts need to be in the same currency. COUNT(DISTINCT normalized_booked_amount.currency_code) != 1
NEGATIVE_TRANSACTION_NORMALIZED_BOOKED_AMOUNT Normalized booked amount value for one or more transactions is negative. Data schema prohibits negative values for this field. normalized_booked_amount.units < 0 OR normalized_booked_amount.nanos < 0
COLUMN_EXISTENCE One or more required columns don't exist in the database. X DOES NOT EXIST IN {{table_name}}
TABLE_EXISTENCE One or more required tables don't exist in the database. TABLE {{table_name}} DOES NOT EXIST
SUPPLEMENTARY_DATA_COLUMN_NAMES One or more party_supplementary_data_id values is not in the range of allowed values. IDs may use alphanumeric characters as well as underscores, and should start with an alphanumeric character. NOT EXISTS REGEXP_CONTAINS(party_supplementary_data_id, "^[a-zA-Z0-9][a-zA-Z0-9_]*$") AND supplementary_data_payload.value IS NOT NULL
EXCESSIVE_PARTY_SUPPLEMENTARY_DATA_IDS Number of distinct party_supplementary_data_id values exceeds the maximum of 100. COUNT(DISTINCT party_supplementary_data_id) > 100
MISSING_PARTY_SUPPLEMENTARY_DATA_ID One or more party supplementary data IDs that was present in the dataset used for model creation is missing in this dataset. X NOT IN (DISTINCT {party_supplementary_data_ids used in model})
AML_SUSPICIOUS_ACTIVITY_START_AFTER_AML_PROCESS_START AML suspicious activity starts after the AML process starts for at least one risk case and party. EXISTS(RCE1, RCE2) WHERE RCE1.party_id = RCE2.party_id AND RCE1.risk_case_id = RCE2.risk_case_id AND RCE1.type = "AML_PROCESS_START" AND RCE2.type = "AML_SUSPICIOUS_ACTIVITY_START" AND RCE1.event_time <= RCE2.event_time
EXCESSIVE_NUMBER_OF_PARTIES The dataset contains more parties than supported by this engine version. COUNT(DISTINCT party_id) >= THRESHOLD
EXCESSIVE_RISK_CASES_PER_PARTY Number of risk cases for at least one party exceeds the predefined threshold. COUNT(DISTINCT risk_case_id) > THRESHOLD GROUP BY party_id
INCONSISTENT_SUSPICIOUS_PERIOD Suspicious period for at least one party and risk case doesn't conform to requirements - suspicious activity start without corresponding end, or suspicious activity end without corresponding start. (party_id, risk_case_id, type = "AML_SUSPICIOUS_ACTIVITY_END") MUTEX (party_id, risk_case_id, type = "AML_SUSPICIOUS_ACTIVITY_START")
INCONSISTENT_SUSPICIOUS_PERIOD Suspicious period for at least one party and risk case doesn't conform to requirements - more than one suspicious activity start or end. COUNT(party_id, risk_case_id, type = "AML_SUSPICIOUS_ACTIVITY_END" OR "AML_SUSPICIOUS_ACTIVITY_START") > 1
INCONSISTENT_SUSPICIOUS_PERIOD Suspicious period for at least one party and risk case doesn't conform to requirements - suspicious activity period end before suspicious activity start. party_id, risk_case_id, type = "AML_SUSPICIOUS_ACTIVITY_END" event time < party_id, risk_case_id, type = "AML_SUSPICIOUS_ACTIVITY_START" event time
JOINABILITY_ISSUE_ACCOUNT_ID The account_id in the table cannot be found in the AccountPartyLink table. The account_id must exist in the AccountPartyLink table. account_id NOT IN (SELECT account_id FROM account_party_link)
JOINABILITY_ISSUE_ACCOUNT_ID_IS_DELETED The account_id is deleted from the AccountPartyLink table (without undeletion) at the validity start / event times in the other tables. account_id (FROM account_party_link) is_entity_deleted = TRUE WHEN account_id (FROM transaction) validity_start_time
JOINABILITY_ISSUE_MIN_ACCOUNT_ID_VALIDITY_START_TIME The minimum account_id validity start time in the AccountPartyLink table is later than the validity start time in Transactions table. account_id (FROM account_party_link) MIN validity_start_time > account_id (FROM transaction) validity_start_time
JOINABILITY_ISSUE_PARTY_ID The party_id in the table cannot be found in the Party table. The party_id must exist in the Party table. party_id NOT IN (SELECT party_id FROM party)
JOINABILITY_ISSUE_PARTY_ID_IS_DELETED The party_id is deleted from the Party table (without undeletion) at the validity start / event times in the other tables. party_id (FROM party) is_entity_deleted = TRUE WHEN party_id (FROM account_party_link) validity_start_time
NO_AML_PROCESS_START_FOR_RISK_CASE One or multiple positive risk cases that resulted in an exit or SAR don't include AML PROCESS START event. (party_id, risk_case_id COUNTIF(type = "AML_PROCESS_START") = 0) WHEN party_id, risk_case_id COUNTIF(type = "AML_EXIT" OR "AML_SAR" ) = 1)
UNSUPPORTED_DATA_TYPE Column present with incorrect data type. See AML AI Input Data model documentation for correct types. X DATA_TYPE IS NOT "expected_data_type"
MISSING_DATA_FOR_REQUIRED_TIME_RANGE (Starting from major engine version v004.007) There are one or multiple periods in the required time range without any valid entries in the Party, Transaction, or AccountPartyLink table. No valid entry in the required time range for some months.

Accessing data validation errors

Data validation errors are included in the LRO response. See Manage long-running operations for more information. The platform logs also contain an entry with the LRO response.