Create a training script

To create a custom model, you need a Python training script that creates and trains the custom model. You initialize your training job with the Python training script, then invoke the training job's run method to run the script.

In this topic, you create the training script, then specify command arguments for your training script.

Create a training script

In this section, you create a training script. This script is a new file in your notebook environment named task.py. Later in this tutorial, you pass this script to the aiplatform.CustomTrainingJob constructor. When the script runs, it does the following:

  • Loads the data in the BigQuery dataset you created.

  • Uses the TensorFlow Keras API to build, compile, and train your model.

  • Specifies the number of epochs and the batch size to use when the Keras Model.fit method is invoked.

  • Specifies where to save model artifacts using the AIP_MODEL_DIR environment variable. AIP_MODEL_DIR is set by Vertex AI and contains the URI of a directory for saving model artifacts. For more information, see Environment variables for special Cloud Storage directories.

  • Exports a TensorFlow SavedModel to the model directory. For more information, see Using the SavedModel format on the TensorFlow website.

To create your training script, run the following code in your notebook:

%%writefile task.py

import argparse
import numpy as np
import os

import pandas as pd
import tensorflow as tf

from google.cloud import bigquery
from google.cloud import storage

# Read environmental variables
training_data_uri = os.getenv("AIP_TRAINING_DATA_URI")
validation_data_uri = os.getenv("AIP_VALIDATION_DATA_URI")
test_data_uri = os.getenv("AIP_TEST_DATA_URI")

# Read args
parser = argparse.ArgumentParser()
parser.add_argument('--label_column', required=True, type=str)
parser.add_argument('--epochs', default=10, type=int)
parser.add_argument('--batch_size', default=10, type=int)
args = parser.parse_args()

# Set up training variables
LABEL_COLUMN = args.label_column

# See https://cloud.google.com/vertex-ai/docs/workbench/managed/executor#explicit-project-selection for issues regarding permissions.
PROJECT_NUMBER = os.environ["CLOUD_ML_PROJECT_ID"]
bq_client = bigquery.Client(project=PROJECT_NUMBER)


# Download a table
def download_table(bq_table_uri: str):
    # Remove bq:// prefix if present
    prefix = "bq://"
    if bq_table_uri.startswith(prefix):
        bq_table_uri = bq_table_uri[len(prefix) :]
        
    # Download the BigQuery table as a dataframe
    # This requires the "BigQuery Read Session User" role on the custom training service account.
    table = bq_client.get_table(bq_table_uri)
    return bq_client.list_rows(table).to_dataframe()

# Download dataset splits
df_train = download_table(training_data_uri)
df_validation = download_table(validation_data_uri)
df_test = download_table(test_data_uri)

def convert_dataframe_to_dataset(
    df_train: pd.DataFrame,
    df_validation: pd.DataFrame,
):
    df_train_x, df_train_y = df_train, df_train.pop(LABEL_COLUMN)
    df_validation_x, df_validation_y = df_validation, df_validation.pop(LABEL_COLUMN)

    y_train = tf.convert_to_tensor(np.asarray(df_train_y).astype("float32"))
    y_validation = tf.convert_to_tensor(np.asarray(df_validation_y).astype("float32"))

    # Convert to numpy representation
    x_train = tf.convert_to_tensor(np.asarray(df_train_x).astype("float32"))
    x_test = tf.convert_to_tensor(np.asarray(df_validation_x).astype("float32"))

    # Convert to one-hot representation
    num_species = len(df_train_y.unique())
    y_train = tf.keras.utils.to_categorical(y_train, num_classes=num_species)
    y_validation = tf.keras.utils.to_categorical(y_validation, num_classes=num_species)

    dataset_train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
    dataset_validation = tf.data.Dataset.from_tensor_slices((x_test, y_validation))
    return (dataset_train, dataset_validation)

# Create datasets
dataset_train, dataset_validation = convert_dataframe_to_dataset(df_train, df_validation)

# Shuffle train set
dataset_train = dataset_train.shuffle(len(df_train))

def create_model(num_features):
    # Create model
    Dense = tf.keras.layers.Dense
    model = tf.keras.Sequential(
        [
            Dense(
                100,
                activation=tf.nn.relu,
                kernel_initializer="uniform",
                input_dim=num_features,
            ),
            Dense(75, activation=tf.nn.relu),
            Dense(50, activation=tf.nn.relu),            
            Dense(25, activation=tf.nn.relu),
            Dense(3, activation=tf.nn.softmax),
        ]
    )
    
    # Compile Keras model
    optimizer = tf.keras.optimizers.RMSprop(lr=0.001)
    model.compile(
        loss="categorical_crossentropy", metrics=["accuracy"], optimizer=optimizer
    )
    
    return model

# Create the model
model = create_model(num_features=dataset_train._flat_shapes[0].dims[0].value)

# Set up datasets
dataset_train = dataset_train.batch(args.batch_size)
dataset_validation = dataset_validation.batch(args.batch_size)

# Train the model
model.fit(dataset_train, epochs=args.epochs, validation_data=dataset_validation)

tf.saved_model.save(model, os.getenv("AIP_MODEL_DIR"))

After you create the script, it appears in the root folder of your notebook:

View training script.

Define arguments for your training script

You pass the following command-line arguments to your training script:

  • label_column - This identifies the column in your data that contains what you want to predict. In this case, that column is species. You defined this in a variable named LABEL_COLUMN when you processed your data. For more information, see Download, preprocess, and split the data.

  • epochs - This is the number of epochs used when you train your model. An epoch is an iteration over the data when training your model. This tutorial uses 20 epochs.

  • batch_size - This is the number of samples that are processed before your model updates. This tutorial uses a batch size of 10.

To define the arguments that are passed to your script, run the following code:

JOB_NAME = "custom_job_unique"

EPOCHS = 20
BATCH_SIZE = 10

CMDARGS = [
    "--label_column=" + LABEL_COLUMN,
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
]