AML input data model

This page describes the AML input data model and its requirements. To see the AML schema, see the tables starting in Core banking data tables. The full AML schema is available in the AML input data model CSV file.

Data model requirements

The following sections describe the table-level requirements and field-level requirements of the data model.

Table-level requirements

Table classification indicates when to include the table in a dataset:

  • MANDATORY: This table must always be included and populated.
  • RECOMMENDED: This table is optional. See the table description for information on how and when to use it.

Field-level requirements

Detailed descriptions for each field are provided in the corresponding section of this document.

  • MANDATORY: This field must always be included.
    • AML AI relies on completeness and correctness of these fields for basic functioning and performance. If you don't correctly complete these fields, you may receive data validation errors or the product may function incorrectly.
    • If a field is NULLABLE, the field description specifies when NULL values are accepted.
  • RECOMMENDED: This field should be included for best AML performance and AML typology coverage, or other use cases.
    • It is strongly recommended that you include these fields for optimal performance and risk typology coverage. However, AML AI is designed to work correctly in the absence of this data.
    • If you don't populate a RECOMMENDED field, you may omit it from your BigQuery dataset.

To help you prioritize which RECOMMENDED fields you should invest the time to populate, see Select data for best performance and risk typology coverage.

Other fields should not be included in your BigQuery dataset.

Completeness of data

The following applies to completeness of data when populating a field:

  • Enums:
    • Enums are represented in the schema as string fields, but with a limited set of supported values listed in the description (for example, Party.type supports "One of: [COMPANY:CONSUMER]").
    • By default, you should not use values not listed in the enum, NULL, or UNKNOWN.
    • Where the value UNKNOWN is supported, it is be explicitly specified in the field description as a supported value and should be used when your systems do not have the required information.
  • Strings:
    • By default, you should not use the empty string, NULL, or your own internal default values.
    • For fields where a default value is supported, the value is clearly stated in the field description.
  • Repeated fields:
    • By default, repeated fields should contain at least one item.
    • When a repeated field may contain no items, it is explicitly stated.

The mode (for example, NULLABLE, REQUIRED, or REPEATED) indicates the values the field will technically accept; see Modes for more information. However, see the field description for more precise guidance.

Understanding how data changes over time

AML AI needs to be able to recreate views of the data at different points in time for tuning, training, and backtesting. To achieve this, AML AI differentiates between mutable entities (that is, entities that can change values over time) and immutable entities. Immutable entities, such as events, do not substantially change after they are created or occur.

Mutable entities

Examples of mutable entities include a party moving abroad (updating Party.residencies) and a transaction progressing from initiation to booked (updating Transaction.book_time).

The validity_start_time field

Mutable entities change their state over time due to real world events. Every time an entity changes, a new row is added to the table with the time of the change captured in the validity_start_time field.

validity_start_time is part of the primary key for any table of mutable entities. See the best practices for setting this field.

The is_entity_deleted field

Sometimes, there is no longer valid data for a mutable entity. To indicate this state, ingest a row with is_entity_deleted set to TRUE. All non-key fields should be left unset or NULL.

Examples of this situation include the following:

  • A party stops being associated with an account. In this case, the corresponding AccountPartyLink entry is no longer valid.
  • An entity is deleted from the corresponding system of record, but you are able to include values for that entity in your dataset prior to deletion.

In most cases, you will use the value NULL (or equivalently FALSE) for this field.

Immutable entities (including events)

The RiskCaseEvent.event_time is an example of an immutable entity. event_time indicates when the event occurred. After this time, you do not expect that the event details will change.

For these entities, there are neither validity_start_time nor is_entity_deleted fields. AML AI does not expect the event to change over time.

Best practices in setting the validity_start_time field

Validity start time is used by AML AI to construct a view of what was known by the bank at a given point in time. This allows the system to accurately train models that can be reused on the latest data (for example, what is currently known by the bank) to produce high fidelity risk scores. For this reason, validity_start_time for a given row represents the earliest time that the data in this row was both known to the bank and known to be correct.

The following best practices clarify how to select validity_start_time in different situations:

  • Example 1: Party A moves abroad and communicates to the bank their new address and the change to their tax residency

    • Case 1: your systems only record the date that this information was received

      Best practice: use this date for validity_start_time

    • Case 2: your systems only record the date that Party A actually changed their tax residency, but not the date that this information was received

      Best practice: use the date of the move, however, not knowing when the bank received this information introduces some risk of leakage and overfitting in AML models

    • Case 3: your systems record both the date that Party A changed their tax residency and the date that this information was received

      Best practice: use the latter of the two dates, as long as there was no further information from the party by that point to invalidate this data (for example, they told you that they had postponed the change by one week)

  • Example 2: Party A arranges on 2023-01-02 in the bank branch to add Party B as joint account holder. The change becomes effective in your systems on 2023-02-15.

    Best practice: AccountPartyLink.validity_start_time is 2023-02-15 since this is the date that the change was made and is recorded in your systems

Joinability

To make sure that a whole dataset can be used together, take note of the following:

  • Foreign keys used in any table should always match to rows in the corresponding table where they are (part of) the primary key. For example, when a party_id is used in an AccountPartyLink entity, that party_id should appear in a Party entity.
  • The same should work when taking into account validity_start_time and is_entity_deleted. Joins should not fail based on valid data at any point in time. For example, when a party_id is used in an AccountPartyLink entity with validity_start_time set to 8:00 AM on 2022-05-04, that party_id should appear in a Party entity with at least one row with the same or earlier value for validity_start_time.

Data validation

AML AI performs limited checks to validate that data provided to it is sufficiently complete and correct for AML AI to function. These checks are conducted during the creation of a dataset, engine config, model, prediction results, and backtest results. Failure of the tests results in a failure of the corresponding API call with errors visible in the logs.

See Data validation errors for information on how to fix these errors.

Core banking data tables

Core banking data helps you to understand your customers and their banking activity in order to detect risky characteristics and behaviors.

Party

The Party table covers your retail or commercial banking customers.

  • Type: MANDATORY table. AML AI requires that this table is populated.
  • Purpose for AML: AML AI takes a customer-centric approach to risk detection. The Party table allows AML AI to understand important details of each Party, and how they changed over time.
  • Scope: include a full history of your party data for the core time window
    • Retail LoB: include all retail banking customers that have held accounts at any point in the core time window
    • Commercial LoB: all commercial banking customers (legal and natural entities) that have held accounts at any point in the core time window
  • Primary key columns: party_id, validity_start_time

Do not include additional parties beyond those described in the preceding section regarding scope. In particular, AML AI does not support the inclusion of all retail and commercial banking customers in the same dataset.

ColumnTypeModeDescription
party_idSTRINGREQUIREDMANDATORY: Unique ID for your customers. Use a pseudo-ID or an internal customer ID instead of an externally facing customer ID.
validity_start_timeTIMESTAMPREQUIREDMANDATORY: Timestamp of when the information in this row reflected the state of the entity and was known to the bank.
is_entity_deletedBOOLNULLABLERECOMMENDED: If set to TRUE, indicates the party as deleted from your system of reference.
source_systemSTRINGNULLABLERECOMMENDED: Name of the system this row was fetched from.
typeSTRINGREQUIREDMANDATORY: Type of this party, to differentiate between a natural person or a legal entity. One of: [COMPANY:CONSUMER].
birth_dateDATENULLABLERECOMMENDED: Date of birth for the Party for natural persons, for example, where Party.type = CONSUMER. Where Party.type = COMPANY, use NULL. Typically also used for fairness evaluation.
establishment_dateDATENULLABLERECOMMENDED: Party's date of establishment with Party.type = COMPANY. Where Party.type = CONSUMER, use NULL.
occupationSTRINGNULLABLERECOMMENDED: Most recent occupation of the Party, if known. Only relevant with Party.type = CONSUMER. Where Party.type = COMPANY, use NULL. If not known, use the empty string. Note that this field is not intended to capture employment level. For example, for a carpenter who is currently unemployed, this field should read "Carpenter".
genderSTRINGNULLABLERECOMMENDED: Gender string for individuals. Typically used for fairness evaluation.
nationalitiesSTRUCTREPEATEDRECOMMENDED: One or more nationalities for natural persons, with an empty list in exceptional cases when your do not have information on the tax residency of the party. Empty for companies. Typically also used for fairness evaluation.
nationalities.region_codeSTRINGREQUIREDMANDATORY: Country or region in two-letter Unicode CLDR code format.
residenciesSTRUCTREPEATEDRECOMMENDED: One or more of the Party's tax residencies (for natural persons), with an empty list in exceptional cases. Empty list for companies.
residencies.region_codeSTRINGREQUIREDMANDATORY: Country or region in two-letter Unicode CLDR code format.
exit_dateDATENULLABLEMANDATORY: The date when the party stopped being a customer. Use NULL if the Party is still a customer, or never was a customer.
join_dateDATENULLABLEMANDATORY: Date when the Party first became your customer. Use NULL if the PARTY never was a customer.
assets_value_rangeSTRUCTNULLABLERECOMMENDED: Approximate value of assets held by the Party. Where possible, this estimate should include assets held by other financial institutions. Include at least the start or end amount.
assets_value_range.start_amountSTRUCTNULLABLERECOMMENDED: The lower bound of the money range (inclusive) or NULL if there is no bound. If provided, it should be positive.
assets_value_range.start_amount.currency_codeSTRINGREQUIREDMANDATORY: The three-letter currency code defined in ISO 4217.
assets_value_range.start_amount.unitsINT64REQUIREDMANDATORY: The whole units of the amount. For example, if currency_code is USD, then 1 unit is one US dollar.
assets_value_range.start_amount.nanosINT64REQUIREDMANDATORY: Number of nano (10^-9) units of the amount. The value must be between -999,999,999 and +999,999,999, inclusive. If units is positive, nanos must be positive or zero. If units is zero, nanos can be positive, zero, or negative. If units is negative, nanos must be negative or zero. For example, $-1.75 is represented as units=-1 and nanos=-750,000,000.
assets_value_range.end_amountSTRUCTNULLABLERECOMMENDED: The upper bound of the money range (exclusive) or NULL if there is no bound. If provided, it should be positive. The range represents a single value when start_amount is equal to end_amount.
assets_value_range.end_amount.currency_codeSTRINGREQUIREDMANDATORY: The three-letter currency code defined in ISO 4217.
assets_value_range.end_amount.unitsINT64REQUIREDMANDATORY: The whole units of the amount. For example, if currency_code is USD, then 1 unit is one US dollar.
assets_value_range.end_amount.nanosINT64REQUIREDMANDATORY: Number of nano (10^-9) units of the amount. The value must be between -999,999,999 and +999,999,999, inclusive. If units is positive, nanos must be positive or zero. If units is zero, nanos can be positive, zero, or negative. If units is negative, nanos must be negative or zero. For example, $-1.75 is represented as units=-1 and nanos=-750,000,000.
civil_status_codeSTRINGNULLABLERECOMMENDED: Civil status of the Party, based on ISO 20022 Civil Status Codes. If not known, use UNKNOWN. Supported values: SINGLE, MARRIED, DIVORCED, WIDOW, STABLE_UNION, SEPARATED, UNKNOWN One of: [SINGLE:LEGALLY_DIVORCED:DIVORCED:WIDOW:STABLE_UNION:SEPARATED:UNKNOWN].
education_level_codeSTRINGNULLABLERECOMMENDED: Most recent educational level of the Party, defined in International Standard Classification of Education (ISCED-11). If not known, use UNKNOWN. Based on standard ISCED-Attainment (ISCED-A) coding. One of: [LESS_THAN_PRIMARY_EDUCATION:PRIMARY_EDUCATION:LOWER_SECONDARY_EDUCATION:UPPER_SECONDARY_EDUCATION:POST_SECONDARY_NON_TERTIARY_EDUCATION:SHORT_CYCLE_TERTIARY_EDUCATION:BACHELORS_OR_EQUIVALENT:MASTERS_OR_EQUIVALENT:DOCTORAL_OR_EQUIVALENT:NOT_ELSEWHERE_CLASSIFIED:UNKNOWN].

The AccountPartyLink table represents how parties are related to accounts. This includes all account holders beyond just the primary account holders.

  • Type: MANDATORY table. AML AI requires that this table is populated.
  • Purpose for AML: shows which parties hold which accounts, and which parties have joint holdings with other parties
  • Scope:
    • All parties in the Party table
    • All accounts for products and services when a party in the Party table is (or was) the primary account holder during the dataset's core time window, plus a 24-month lookback window
  • Primary key columns: account_id, party_id, validity_start_time
ColumnTypeModeDescription
account_idSTRINGREQUIREDMANDATORY: Unique ID for this Account.
party_idSTRINGREQUIREDMANDATORY: Unique ID for this Party in the Party table.
validity_start_timeTIMESTAMPREQUIREDMANDATORY: Timestamp when the information of this row was in the correct state for the entity.
is_entity_deletedBOOLNULLABLERECOMMENDED: If set to TRUE, indicates the link is no longer present.
roleSTRINGNULLABLEMANDATORY: Describes how the party is related to the account. In particular, capturing primary and secondary account holders. Each Account should have at least one PRIMARY_HOLDER at any given time. Supported values: PRIMARY_HOLDER, SECONDARY_HOLDER, SUPPLEMENTARY_HOLDER. Do not use NULL for this field. One of: [PRIMARY_HOLDER:SECONDARY_HOLDER:SUPPLEMENTARY_HOLDER].
source_systemSTRINGNULLABLERECOMMENDED: Name of the system this row was fetched from.

Transaction

A transaction represents a movement of funds from or to an account in the AccountPartyLink table.

  • Type: MANDATORY table. AML AI requires that this table is populated.
  • Purpose for AML: helps the models understand the flow of money
  • Scope:
    • All transactions into, out of, or between accounts in the AccountPartyLink table should be captured in the Transaction table, including the core time window plus 24-month lookback window
      • Transactions between two accounts in the AccountPartyLink table require a pair of rows in the Transaction table with separate values for the transaction_id field, one with a direction of DEBIT and one with a direction of CREDIT
    • Includes cards and other payments, wire transfers, and movements of funds between accounts of the same party
  • Primary key columns: transaction_id, validity_start_time
ColumnTypeModeDescription
transaction_idSTRINGREQUIREDMANDATORY: Unique ID of this transaction, from the point of view of the specified Account. Note that a transfer is between two accounts, both in the Account table, and should be represented by two transactions with separate transaction IDs, one with direction = DEBIT and one with direction = CREDIT.
validity_start_timeTIMESTAMPREQUIREDMANDATORY: Timestamp when the information of this row was the correct state of the entity.
is_entity_deletedBOOLNULLABLERECOMMENDED: If set to TRUE, indicates the Transaction is no longer present in your system of record.
source_systemSTRINGNULLABLERECOMMENDED: Name of the system that this row was fetched from.
typeSTRINGREQUIREDMANDATORY: High level type of the transaction. Supported values: CARD, transaction involving a credit or debit card (includes GPay); CASH, transaction where cash is paid into or withdrawn from an account; CHECK, transaction where a cheque is used; WIRE, default for other transactions, including transfers between accounts at the same or different banks. One of: [WIRE:CASH:CHECK:CARD:OTHER].
directionSTRINGREQUIREDMANDATORY: Direction of the transaction assets flow from the point of view of the specified account ID. Uses common definition of credit/debit in banking. Supported values: DEBIT, represents a transfer of value from this account; CREDIT, represents a transfer of value to this account One of: [DEBIT:CREDIT].
account_idSTRINGREQUIREDMANDATORY: Account ID of the Account in the Account Party Link table.
counterparty_accountSTRUCTNULLABLEMANDATORY: Details of counterparty and their account used in this transaction. Relates specifically to WIRE, CASH and CHECK transactions. For CARD transactions, this may represent the merchant or be NULL in case you do not have these details.
counterparty_account.account_idSTRINGNULLABLEMANDATORY: Account ID of a transaction's counterparty, if in the AccountPartyLink table. Can be NULL if the counterparty account is not represented in the Account Party Link table, for example, the customer of another financial institution.
book_timeTIMESTAMPNULLABLEMANDATORY: Time when the transaction was booked. Use NULL if a book time is not yet known.
normalized_booked_amountSTRUCTNULLABLEMANDATORY: Non-negative booked amount posted on the party's account, normalized to a standard currency for the whole dataset.
normalized_booked_amount.currency_codeSTRINGREQUIREDMANDATORY: The three-letter currency code defined in ISO 4217.
normalized_booked_amount.unitsINT64REQUIREDMANDATORY: The whole units of the amount. For example, if currency_code is USD, then 1 unit is one US dollar.
normalized_booked_amount.nanosINT64REQUIREDMANDATORY: Number of nano (10^-9) units of the amount. The value must be between -999,999,999 and +999,999,999, inclusive. If units is positive, nanos must be positive or zero. If units is zero, nanos can be positive, zero, or negative. If units is negative, nanos must be negative or zero. For example, $-1.75 is represented as units=-1 and nanos=-750,000,000.

Risk investigation data tables

A risk case tracks how your business alerts, investigates, and acts on suspicious or risky activity related to AML. A single risk case may have events related to multiple parties, indicating related investigations.

RiskCaseEvent

The RiskCaseEvent table tracks the progress of a risk case from the initial alert to the case outcome. See Lifecycle of a risk case for more information.

  • Type: MANDATORY table. AML AI requires that this table is populated.
  • Purpose for AML:
    • Creation of training labels for training, tuning, and backtesting of models
    • Understanding previously identified money laundering risk
  • Scope: includes all AML investigation-related events related to a risk case and party for which the AML_PROCESS_START (start of investigation) occurs within the core time window, plus the 12-month lookback window
  • Primary key columns: risk_case_event_id
ColumnTypeModeDescription
risk_case_event_idSTRINGREQUIREDMANDATORY: Unique ID for this event.
event_timeTIMESTAMPREQUIREDMANDATORY: Timestamp when this event occurs.
typeSTRINGREQUIREDMANDATORY: Type of risk case event. Only AML events are supported. The minimum requirement for the AML product to work is to provide a single AML_PROCESS_START, AML_PROCESS_END and where applicable AML_EXIT event per Party and risk_case_id. It is strongly recommended to also include AML_SUSPICIOUS_ACTIVITY_START and AML_SUSPICIOUS_ACTIVITY_END events for best model performance. One of: [AML_SUSPICIOUS_ACTIVITY_START:AML_SUSPICIOUS_ACTIVITY_END:AML_PROCESS_START:AML_PROCESS_END:AML_ALERT_GOOGLE:AML_ALERT_LEGACY:AML_ALERT_ADHOC:AML_ALERT_EXPLORATORY:AML_SAR:AML_EXTERNAL:AML_EXIT].
party_idSTRINGREQUIREDMANDATORY: The ID in the Party table of the Party that is the subject of this investigation process.
risk_case_idSTRINGREQUIREDMANDATORY: The ID of the overall case to which this event belongs.

Supplementary data tables

You can add supplementary data to the AML models when the data is relevant to predicting risk of money laundering. This data goes beyond what is covered in the core banking and risk investigation sections. For example, you may identify and add a risk indicator that helps models better predict a risk typology that is not otherwise covered.

PartySupplementaryData

The PartySupplementaryData table lets you add supplementary data for parties in the Party table.

  • Type: RECOMMENDED table. AML AI does not require that this table is populated.
  • Purpose for AML: enhances model performance with additional, relevant Party data
  • Scope:
    • This table can be empty. You should start using AML AI without providing any supplementary data.
    • For any supplementary data that is provided, you should include a full history for that data covering the core time window; no lookback window is required.
    • At most, 100 different party_supplementary_data_id values may be used (for example, at most 100 different values per customer at any point in time).
  • Primary key columns: party_supplementary_data_id, party_id, validity_start_time
ColumnTypeModeDescription
party_supplementary_data_idSTRINGREQUIREDMANDATORY: Identifier for the supplementary data field. This ID should match between Datasets used for the same modeling process (for example, for tuning, training, predict and backtesting steps). IDs may use letters, numbers and underscores, e.g. [a-zA-Z0-9_]. The first character should be a letter or number [a-zA-Z0-9]. With engine versions released prior to 2024-03-01 these IDs must use values from 1 to 100.
validity_start_timeTIMESTAMPREQUIREDMANDATORY: Timestamp when this record became valid.
is_entity_deletedBOOLNULLABLERECOMMENDED: If set to TRUE, indicates this supplementary data field no longer has a value for this Party.
source_systemSTRINGNULLABLERECOMMENDED: Name of the system this row was fetched from.
party_idSTRINGREQUIREDMANDATORY: ID of the Party to which this value of this supplementary data field refers.
supplementary_data_payloadSTRUCTREQUIREDMANDATORY: Payload of the supplementary data; only supports float64 values.
supplementary_data_payload.valueFLOAT64REQUIREDMANDATORY: The value of this supplementary data field. Type: float64.

Party registration tables

To import registered parties, first prepare a BigQuery table for the line of business you want to register parties for. For more information, see Register your customers. Use one of the following schemas:

  • Retail parties schema

    ColumnTypeDescription
    party_idSTRINGUnique identifier of the party in the instance's datasets
    party_sizeSTRINGNULL; content is ignored for retail party registrations

  • Commercial parties schema

    ColumnTypeDescription
    party_idSTRINGUnique identifier of the party in the instance's datasets
    party_sizeSTRING Requested party size. The tier is based on the average number of monthly transactions for the party over the preceding 365 days:
    • SMALL for small commercial parties with less than 500 average monthly transactions
    • LARGE for large commercial parties with greater than or equal to 500 average monthly transactions

    All values are case sensitive.