Create and manage datasets

This page shows you how to create and manage AML AI datasets. A dataset is used as an input for the training, prediction, and backtesting pipelines. A dataset contains references to BigQuery tables in a Google Cloud project.

You only need to create the dataset at this point. The other dataset methods are provided as a convenience.

Before you begin

  • To get the permissions that you need to create and manage datasets, ask your administrator to grant you the Financial Services Admin (financialservices.admin) IAM role on your project. For more information about granting roles, see Manage access.

    You might also be able to get the required permissions through custom roles or other predefined roles.

  • Create an instance

Create a dataset

Some API methods return a long-running operation (LRO). These methods are asynchronous. The operation might not be completed when the method returns a response. For these methods, send the request and then check for the result.

Send the request

To create a dataset, use the projects.locations.instances.datasets.create method.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: your Google Cloud project ID listed in the IAM Settings
  • LOCATION: the location of the instance; use one of the supported regions:
    • us-central1
    • us-east1
    • asia-south1
    • europe-west1
    • europe-west2
    • europe-west4
    • northamerica-northeast1
    • southamerica-east1
  • INSTANCE_ID: the user-defined identifier for the instance
  • DATASET_ID: a user-defined identifier for the AML AI dataset; use only lowercase letters, numbers, dashes, and underscores (for example, train_jan2018_apr2020)
  • BQ_INPUT_DATASET_NAME: the BigQuery input dataset name
  • PARTY_TABLE: the Party table in the BigQuery input dataset
  • ACCOUNT_PARTY_LINK_TABLE: the AccountPartyLink table in the BigQuery input dataset
  • TRANSACTION_TABLE: the Transaction table in the BigQuery input dataset
  • RISK_CASE_EVENT_TABLE: the RiskCaseEvent table in the BigQuery input dataset
  • PARTY_SUPPLEMENTARY_DATA: the PartySupplementaryData table in the BigQuery input dataset; this table is optional and can be removed from the request JSON
  • DATA_START_DATE: the start date and time of the data to use in the dataset; use RFC3339 UTC "Zulu" format (for example, 2014-10-02T15:01:23Z)
  • DATA_END_DATE: the end date and time of the data to use in the dataset; use RFC3339 UTC "Zulu" format (for example, 2014-10-02T15:01:23Z)

Request JSON body:

{
  "tableSpecs": {
    "party": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_TABLE",
    "account_party_link": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.ACCOUNT_PARTY_LINK_TABLE",
    "transaction": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.TRANSACTION_TABLE",
    "risk_case_event": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.RISK_CASE_EVENT_TABLE",
    "party_supplementary_data": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_SUPPLEMENTARY_DATA"
  },
  "dateRange": {
    "startTime": "DATA_START_DATE",
    "endTime": "DATA_END_DATE"
  },
  "timeZone": {
    "id": "UTC"
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json. Run the following command in the terminal to create or overwrite this file in the current directory:

cat > request.json << 'EOF'
{
  "tableSpecs": {
    "party": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_TABLE",
    "account_party_link": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.ACCOUNT_PARTY_LINK_TABLE",
    "transaction": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.TRANSACTION_TABLE",
    "risk_case_event": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.RISK_CASE_EVENT_TABLE",
    "party_supplementary_data": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_SUPPLEMENTARY_DATA"
  },
  "dateRange": {
    "startTime": "DATA_START_DATE",
    "endTime": "DATA_END_DATE"
  },
  "timeZone": {
    "id": "UTC"
  }
}
EOF

Then execute the following command to send your REST request:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets?dataset_id=DATASET_ID"

PowerShell

Save the request body in a file named request.json. Run the following command in the terminal to create or overwrite this file in the current directory:

@'
{
  "tableSpecs": {
    "party": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_TABLE",
    "account_party_link": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.ACCOUNT_PARTY_LINK_TABLE",
    "transaction": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.TRANSACTION_TABLE",
    "risk_case_event": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.RISK_CASE_EVENT_TABLE",
    "party_supplementary_data": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_SUPPLEMENTARY_DATA"
  },
  "dateRange": {
    "startTime": "DATA_START_DATE",
    "endTime": "DATA_END_DATE"
  },
  "timeZone": {
    "id": "UTC"
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

Then execute the following command to send your REST request:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets?dataset_id=DATASET_ID" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.financialservices.v1.OperationMetadata",
    "createTime": CREATE_TIME,
    "target": "projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID",
    "verb": "create",
    "requestedCancellation": false,
    "apiVersion": "v1"
  },
  "done": false
}

Check for the result

Use the projects.locations.operations.get method to check if the dataset has been created. If the response contains "done": false, repeat the command until the response contains "done": true. These operations can take a few minutes to several hours to complete.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: your Google Cloud project ID listed in the IAM Settings
  • LOCATION: the location of the instance; use one of the supported regions:
    • us-central1
    • us-east1
    • asia-south1
    • europe-west1
    • europe-west2
    • europe-west4
    • northamerica-northeast1
    • southamerica-east1
  • OPERATION_ID: the identifier for the operation

To send your request, choose one of these options:

curl

Execute the following command:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID"

PowerShell

Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.financialservices.v1.OperationMetadata",
    "createTime": CREATE_TIME,
    "endTime": END_TIME,
    "target": "projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID",
    "verb": "create",
    "requestedCancellation": false,
    "apiVersion": "v1"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.financialservices.v1.Dataset",
    "name": "projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID",
    "createTime": CREATE_TIME,
    "updateTime": UPDATE_TIME,
    "tableSpecs": {
      "party": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_TABLE",
      "account_party_link": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.ACCOUNT_PARTY_LINK_TABLE",
      "transaction": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.TRANSACTION_TABLE",
      "risk_case_event": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.RISK_CASE_EVENT_TABLE",
      "party_supplementary_data": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_SUPPLEMENTARY_DATA"
    },
    "state": "ACTIVE",
    "dateRange": {
      "start_time": "DATA_START_DATE",
      "end_time": "DATA_END_DATE"
    },
    "timeZone": {
      "id": "UTC"
    }
  }
}

Optional methods

The following dataset methods are provided as a convenience.

Get a dataset

To get a dataset, use the projects.locations.instances.datasets.get method.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: your Google Cloud project ID listed in the IAM Settings
  • LOCATION: the location of the instance; use one of the supported regions:
    • us-central1
    • us-east1
    • asia-south1
    • europe-west1
    • europe-west2
    • europe-west4
    • northamerica-northeast1
    • southamerica-east1
  • INSTANCE_ID: the user-defined identifier for the instance
  • DATASET_ID: the user-defined identifier for the dataset

To send your request, choose one of these options:

curl

Execute the following command:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID"

PowerShell

Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID",
  "createTime": CREATE_TIME,
  "updateTime": UPDATE_TIME,
  "tableSpecs": {
    "party": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_TABLE",
    "account_party_link": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.ACCOUNT_PARTY_LINK_TABLE",
    "transaction": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.TRANSACTION_TABLE",
    "risk_case_event": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.RISK_CASE_EVENT_TABLE",
    "party_supplementary_data": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_SUPPLEMENTARY_DATA"
  },
  "state": "ACTIVE",
  "dateRange": {
    "start_time": "DATA_START_DATE",
    "end_time": "DATA_END_DATE"
  },
  "timeZone": {
    "id": "UTC"
  }
}

Update a dataset

To update a dataset, use the projects.locations.instances.datasets.patch method.

Not all fields in a dataset can be updated. The following example updates the key-value pair user labels associated with the dataset.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: your Google Cloud project ID listed in the IAM Settings
  • LOCATION: the location of the instance; use one of the supported regions:
    • us-central1
    • us-east1
    • asia-south1
    • europe-west1
    • europe-west2
    • europe-west4
    • northamerica-northeast1
    • southamerica-east1
  • INSTANCE_ID: a user-defined identifier for the instance
  • DATASET_ID: the user-defined identifier for the dataset
  • KEY: The key in a key-value pair used to organize datasets. See labels for more information.
  • VALUE: The value in a key-value pair used to organize datasets. See labels for more information.

Request JSON body:

{
  "labels": {
    "KEY": "VALUE"
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json. Run the following command in the terminal to create or overwrite this file in the current directory:

cat > request.json << 'EOF'
{
  "labels": {
    "KEY": "VALUE"
  }
}
EOF

Then execute the following command to send your REST request:

curl -X PATCH \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID?updateMask=labels"

PowerShell

Save the request body in a file named request.json. Run the following command in the terminal to create or overwrite this file in the current directory:

@'
{
  "labels": {
    "KEY": "VALUE"
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

Then execute the following command to send your REST request:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method PATCH `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID?updateMask=labels" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.financialservices.v1.OperationMetadata",
    "createTime": CREATE_TIME,
    "target": "projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID",
    "verb": "update",
    "requestedCancellation": false,
    "apiVersion": "v1"
  },
  "done": false
}

For more information on how to get the result of the long-running operation (LRO), see Check for the result.

List the datasets

To list the datasets for a given instance, use the projects.locations.instances.datasets.list method.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: your Google Cloud project ID listed in the IAM Settings
  • LOCATION: the location of the instance; use one of the supported regions:
    • us-central1
    • us-east1
    • asia-south1
    • europe-west1
    • europe-west2
    • europe-west4
    • northamerica-northeast1
    • southamerica-east1
  • INSTANCE_ID: the user-defined identifier for the instance

To send your request, choose one of these options:

curl

Execute the following command:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets"

PowerShell

Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "datasets": [
    {
      "name": "projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID",
      "createTime": CREATE_TIME,
      "updateTime": UPDATE_TIME,
      "tableSpecs": {
        "party": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_TABLE",
        "account_party_link": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.ACCOUNT_PARTY_LINK_TABLE",
        "transaction": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.TRANSACTION_TABLE",
        "risk_case_event": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.RISK_CASE_EVENT_TABLE",
        "party_supplementary_data": "bq://PROJECT_ID.BQ_INPUT_DATASET_NAME.PARTY_SUPPLEMENTARY_DATA"
      },
      "state": "ACTIVE",
      "dateRange": {
        "start_time": "DATA_START_DATE",
        "end_time": "DATA_END_DATE"
      },
      "timeZone": {
        "id": "UTC"
      }
    }
  ]
}

Delete a dataset

To delete a dataset, use the projects.locations.instances.datasets.delete method.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: your Google Cloud project ID listed in the IAM Settings
  • LOCATION: the location of the instance; use one of the supported regions:
    • us-central1
    • us-east1
    • asia-south1
    • europe-west1
    • europe-west2
    • europe-west4
    • northamerica-northeast1
    • southamerica-east1
  • INSTANCE_ID: the user-defined identifier for the instance
  • DATASET_ID: the user-defined identifier for the dataset

To send your request, choose one of these options:

curl

Execute the following command:

curl -X DELETE \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID"

PowerShell

Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method DELETE `
-Headers $headers `
-Uri "https://financialservices.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.financialservices.v1.OperationMetadata",
    "createTime": CREATE_TIME,
    "target": "projects/PROJECT_ID/locations/LOCATION/instances/INSTANCE_ID/datasets/DATASET_ID",
    "verb": "delete",
    "requestedCancellation": false,
    "apiVersion": "v1"
  },
  "done": false
}

For more information on how to get the result of the long-running operation (LRO), see Check for the result.