Data validation errors

Gathering and merging the core data needed to run AML AI may be error-prone. To address this, AML AI has built-in data validation checks to provide you actionable feedback on how to address dataset-related issues.

The data validation checks are executed during two phases of the model deployment process, with any resulting errors included in the long-running operation (LRO) response.

  • One check is executed as part of the creation of the dataset.
  • Validation is also run at the beginning of other operations (tune, train, backtest, and predict).

The LRO Status contains one ErrorInfo message for each failed check. The ErrorInfo message's reason field contains a stable string constant while other relevant information is provided in the metadata field. For more information on RPC errors, see AIP-193.

Data validation checks and error output

Every validation failure contains the following corresponding error information:

Field Description
reason Unique identifier for this type of error
metadata["count"] Number of occurrences of this error
metadata["data_table"] The name of the input table in which the error occurred
metadata["data_field"] The name of the field in the input table in which the error occurred
metadata["description"] Detailed, actionable error description exposed in metadata
metadata["test"] Illustrative (pseudo-SQL) statement explaining the logic of the validation

The following table lists all data validation checks performed by AML AI, their descriptions, and an example test response:

reason metadata["description"] metadata["test"]
NOT_NULL_COLUMN_WITH_NULLS One or more NOT NULL columns contain one or more null values. X IS NULL
DATE_TIME_DIFFERENCE The validity_start_time cannot include dates that are greater than today's date and validity_start_time must be greater than threshold. DATETIME_DIFF(CURRENT_TIMESTAMP(), validity_start_time, DAY) < 0
EXCESSIVE_ACCOUNTS_FOR_PARTY Number of accounts for the party exceeds the predefined threshold. COUNT(DISTINCT account_id) > {{ var('overlarge_account_count')}}
EXCESSIVE_PARTIES_FOR_SHARED_ACCOUNT Number of account holders for the account exceeds the predefined threshold. COUNT(DISTINCT party_id) > {{ var('overlarge_account_holders')}} GROUP BY account_id
MISSING_AML_EXIT_LABELS All, or no parties have AML exit events. Useful AML model labels cannot be created. COUNT(party_id) WHERE type IN {{ positive_event_types}}) IN (0, count(party)
DUPLICATE_PRIMARY_KEY There is a duplicate primary key value in the database resulting in a unique key violation. Note that for tables with validity_start_time, the primary key includes validity_start_time. GROUP BY X, validity_start_time HAVING count(1) > 1
NAN_VALUE_IN_FLOAT_COLUMN One or more risk case event scores are not numbers. Scores for risk case events can only be numbers. IS_NAN(score)
INSUFFICIENT_DATE_RANGE The date range in the dataset specifies an insufficient number of months for any AML AI operation. The sufficient number of months for prediction is 24, and more for other operations. COUNTIF((MAX({{transaction_time_column}}) - MIN({{transaction_time_column}})) > 3 years)/COUNT(*) < {{ var('short_timeframe_ratio') }} GROUP BY account_id
EMPTY_TABLE One or more required tables in the database is empty. COUNT(*) FROM X < 1
UNSUPPORTED_VALUE One or more columns include values that are not in the set of allowed values. X NOT IN ("a1", "b2", "c3")
DUPLICATE_RISK_CASE_EVENTS_TYPE_AML_EXIT Party was exited from the bank multiple times. AML_EXIT risk case event type is not allowed to occur multiple times. party_id, risk_case_id, countif(type = "AML_EXIT") > 1
DUPLICATE_RISK_CASE_EVENTS_TYPE_AML_PROCESS_START Multiple AML investigation processes were initiated against the party in this risk case. party_id, risk_case_id, countif(type = "AML_PROCESS_START") > 1
DUPLICATE_RISK_CASE_EVENTS_TYPE_AML_PROCESS_END Multiple AML investigation processes were closed against the party in this risk case. Party_id, risk_case_id countif(type = "AML_PROCESS_END") > 1
UNNORMALIZED_BOOKED_AMOUNT_CURRENCY Normalized booked amount includes multiple currencies in the Transaction table. All normalized amounts need to be in the same currency. COUNT(DISTINCT normalized_booked_amount.currency_code) != 1
NEGATIVE_TRANSACTION_INITIATED_AMOUNT Initiated amount value for the transaction is negative. Data schema prohibits negative values for this field. initiated_amount.units < 0 OR initiated_amount.nanos < 0
NEGATIVE_TRANSACTION_NORMALIZED_BOOKED_AMOUNT Normalized booked amount value for one or more transactions is negative. Data schema prohibits negative values for this field. normalized_booked_amount.units < 0 OR normalized_booked_amount.nanos < 0
COLUMN_EXISTENCE One or more required columns do not exist in the database. X DOES NOT EXIST IN {{table_name}}
TABLE_EXISTENCE One or more required tables do not exist in the database. TABLE {{table_name}} DOES NOT EXIST
SUPPLEMENTARY_DATA_COLUMN_NAMES One or more party_supplementary_data_id values is not in the range of allowed values. IDs may use alphanumeric characters as well as underscores, and should start with an alphanumeric character. NOT EXISTS REGEXP_CONTAINS(party_supplementary_data_id, "^[a-zA-Z0-9][a-zA-Z0-9_]*$") AND supplementary_data_payload.value IS NOT NULL
EXCESSIVE_PARTY_SUPPLEMENTARY_DATA_IDS Number of distinct party_supplementary_data_id values exceeds the maximum of 100. COUNT(DISTINCT party_supplementary_data_id) > 100
MISSING_PARTY_SUPPLEMENTARY_DATA_ID One or more party supplementary data IDs that was present in the dataset used for model creation is missing in this dataset. X NOT IN (DISTINCT {party_supplementary_data_ids used in model})

Accessing data validation errors

Data validation errors are included in the LRO response. See Manage long-running operations for more information. The platform logs also contain an entry with the LRO response.