Train a forecast model

This page shows you how to train an AutoML forecast model from a tabular dataset using either the Google Cloud console or the Vertex AI API.

Before you begin

Before you can train a forecast model, you must complete the following:

Train a model

Google Cloud console

  1. In the Google Cloud console, in the Vertex AI section, go to the Datasets page.

    Go to the Datasets page

  2. Click the name of the dataset you want to use to train your model to open its details page.

  3. If your data type uses annotation sets, select the annotation set you want to use for this model.

  4. Click Train new model.

  5. In the Train new model page, configure as follows:

    1. Select the model training method.

      • AutoML is a good choice for a wide range of use cases.
      Click Continue.

    2. Enter the display name for your new model.

    3. Select your target column.

      The target column is the value that the model will forecast. Learn more about target column requirements.

    4. If you did not set your Series identifier and Timestamp columns on your dataset, select them now.

    5. Select your Data granularity.

    6. Enter your Context window and Forecast horizon.

      If you do not specify a Context window, it defaults to the value set for Forecast horizon. For more information, see Considerations for setting the context window and forecast horizon.

    7. If you would like to export your test dataset to BigQuery, check Export test dataset to BigQuery and provide the name of the table.

    8. If you want Vertex AI to continue training even if your data has validation errors, you can select Ignore validation.

      Unless you understand the source of data errors and their impact on model quality, you should allow Vertex AI to cancel training for validation errors.

    9. If you want to manually control your data split, open the Advanced options.

    10. The default data split is chronological, with the standard 80/10/10 percentages. If you would like to manually specify which rows are assigned to which split, select Manual and specify your Data split column.

      Learn more about data splits.

    11. Click Continue.

    12. If you haven't already, click Generate statistics.

      Generating statistics populates the Transformation dropdown menus.

    13. On the Training options page, review your column list and exclude any columns from training that should not be used to train the model.

      If you are using a data split column, it should be included.

    14. Review the transformations selected for your included features and make any required updates.

      Rows containing data that is invalid for the selected transformation are excluded from training. Learn more about transformations.

    15. For each column you included for training, specify the Feature type for how that feature relates to its time series, and whether it is available at forecast time. Learn more about feature type and availability.

    16. If you want to specify a weight column, or change your optimization objective from the default, open the Advanced options and make your selections.

      Learn more about weight columns and optimization objectives.

    17. Click Continue.

    18. In the Compute and pricing window, enter the maximum number of hours you want your model to train for.

      This setting helps you put a cap on the training costs. The actual time elapsed can be longer than this value, because there are other operations involved in creating a new model.

      Suggested training time is related to the size of your forecast horizon and your training data. The table below provides some sample forecasting training runs, and the range of training time that was needed to train a high-quality model.

      Rows Features Forecast horizon Training time
      12 million 10 6 3-6 hours
      20 million 50 13 6-12 hours
      16 million 30 365 24-48 hours

      For information about training pricing, see the pricing page.

    19. Click Start Training.

      Model training can take many hours, depending on the size and complexity of your data and your training budget, if you specified one. You can close this tab and return to it later. You will receive an email when your model has completed training.

API

Select a tab for your language or environment:

REST & CMD LINE

You use the trainingPipelines.create command to train a model.

Train the model.

Before using any of the request data, make the following replacements:

  • LOCATION: Your region.
  • PROJECT: Your project ID.
  • TRAINING_PIPELINE_DISPLAY_NAME: Display name for the training pipeline created for this operation.
  • TRAINING_TASK_DEFINITION: The model training method
    • AutoML: A good choice for a wide range of use cases.
      gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_forecasting_1.0.0.yaml
    • Seq2Seq+: A good choice for experimentation. The algorithm is likely to converge faster than AutoML because its architecture is simpler and it uses a smaller search space. Our experiments find that Seq2Seq+ performs well with a small time budget and on dataset smaller than 1 GB in size.
      gs://google-cloud-aiplatform/schema/trainingjob/definition/seq2seq_plus_time_series_forecasting_1.0.0.yaml
  • TARGET_COLUMN: The column (value) you want this model to predict.
  • TIME_COLUMN: The time column. Learn more.
  • TIME_SERIES_IDENTIFIER_COLUMN: The time series identifier column. Learn more.
  • WEIGHT_COLUMN: (Optional) The weight column. Learn more.
  • TRAINING_BUDGET: The maximum amount of time you want the model to train, in milli node hours (1,000 milli node hours equals one node hour).
  • GRANULARITY_UNIT: The unit to use for the granularity of your training data and your forecast horizon and context window. Can be minute, hour, day, week, month, or year. Select day if you would like to use holiday effect modeling. Learn more.
  • GRANULARITY_QUANTITY: The number of granularity units that make up the interval between observations in your training data. Must be one for all units except minutes, which can be 1, 5, 10, 15, or 30. Learn more.
  • GROUP_COLUMNS: Column names in your training input table that identify the grouping for the hierarchy level. The column(s) must be `time_series_attribute_columns`. Learn more.
  • GROUP_TOTAL_WEIGHT: Weight of the group aggregated loss relative to the individual loss. Disabled if set to `0.0` or is not set. If the group column is not set, all time series will be treated as part of the same group and is aggregated over all time series. Learn more.
  • TEMPORAL_TOTAL_WEIGHT: Weight of the time aggregated loss relative to the individual loss. Disabled if set to `0.0` or is not set. Learn more.
  • GROUP_TEMPORAL_TOTAL_WEIGHT: Weight of the total (group x time) aggregated loss relative to the individual loss. Disabled if set to `0.0` or is not set. If the group column is not set, all time series will be treated as part of the same group and is aggregated over all time series. Learn more.
  • HOLIDAY_REGIONS: (Optional) One or more geographical regions based on which the holiday effect is applied in modeling. During training, Vertex AI creates holiday categorical features within the model based on the date from the time column and the specified geographical regions. To enable it, set GRANULARITY_UNIT to day and specify one or more regions in the HOLIDAY_REGIONS field. By default, holiday effect modeling is disabled.

    Acceptable values include the following:

    • GLOBAL: Detects holidays for all world regions.
    • NA: Detects holidays for North America
    • JAPAC: Detects holidays for Japan and Asia Pacific
    • EMEA: Detects holidays for Europe, the Middle East and Africa
    • LAC: Detects holidays for Latin America and the Caribbean
    • ISO 3166-1 Country codes: Detects holidays for individual countries.
  • FORECAST_HORIZON: The size of the forecast horizon, specified in granularity units. The forecast horizon is the period of time the model should forecast results for. Learn more.
  • CONTEXT_WINDOW: The number of granularity units the model should look backward to include at training time. Learn more.
  • OPTIMIZATION_OBJECTIVE: Required only if you do not want the default optimization objective for your prediction type. Learn more.
  • TIME_SERIES_ATTRIBUTE_COL: The name or names of the columns that are time series attributes. Learn more.
  • AVAILABLE_AT_FORECAST_COL: The name or names of the covariate columns whose value is known at forecast time. Learn more.
  • UNAVAILABLE_AT_FORECAST_COL: The name or names of the covariate columns whose value is unknown at forecast time. Learn more.
  • TRANSFORMATION_TYPE: The transformation type is provided for each column used to train the model. Learn more.
  • COLUMN_NAME: The name of the column with the specified transformation type. Every column used to train the model must be specified.
  • MODEL_DISPLAY_NAME: Display name for the newly trained model.
  • DATASET_ID: ID for the training Dataset.
  • You can provide a Split object to control your data split. For information about controlling data split, see Control the data split using REST.
  • You can provide a windowConfig object to configure a forecasting window. For further information, see Configure the forecasting window using REST.
  • PROJECT_NUMBER: Project number for your project

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/trainingPipelines

Request JSON body:

{
    "displayName": "TRAINING_PIPELINE_DISPLAY_NAME",
    "trainingTaskDefinition": "TRAINING_TASK_DEFINITION",
    "trainingTaskInputs": {
        "targetColumn": "TARGET_COLUMN",
        "timeColumn": "TIME_COLUMN",
        "timeSeriesIdentifierColumn": "TIME_SERIES_IDENTIFIER_COLUMN",
        "weightColumn": "WEIGHT_COLUMN",
        "trainBudgetMilliNodeHours": TRAINING_BUDGET,
        "dataGranularity": {"unit": "GRANULARITY_UNIT", "quantity": GRANULARITY_QUANTITY},
        "hierarchyConfig": {"groupColumns": GROUP_COLUMNS, "groupTotalWeight": GROUP_TOTAL_WEIGHT, "temporalTotalWeight": TEMPORAL_TOTAL_WEIGHT, "groupTemporalTotalWeight": GROUP_TEMPORAL_TOTAL_WEIGHT}
        "holidayRegions" : ["HOLIDAY_REGIONS_1", "HOLIDAY_REGIONS_2", ...]
        "forecast_horizon": FORECAST_HORIZON,
        "context_window": CONTEXT_WINDOW,
        "optimizationObjective": "OPTIMIZATION_OBJECTIVE",
        "time_series_attribute_columns": ["TIME_SERIES_ATTRIBUTE_COL_1", "TIME_SERIES_ATTRIBUTE_COL_2", ...]
        "available_at_forecast_columns": ["AVAILABLE_AT_FORECAST_COL_1", "AVAILABLE_AT_FORECAST_COL_2", ...]
        "unavailable_at_forecast_columns": ["UNAVAILABLE_AT_FORECAST_COL_1", "UNAVAILABLE_AT_FORECAST_COL_2", ...]
        "transformations": [
            {"TRANSFORMATION_TYPE_1":  {"column_name" : "COLUMN_NAME_1"} },
            {"TRANSFORMATION_TYPE_2":  {"column_name" : "COLUMN_NAME_2"} },
            ...
    },
    "modelToUpload": {"displayName": "MODEL_DISPLAY_NAME"},
    "inputDataConfig": {
      "datasetId": "DATASET_ID",
    }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_NUMBER/locations/LOCATION/trainingPipelines/TRAINING_PIPELINE_ID",
  "displayName": "myModelName",
  "trainingTaskDefinition": "gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_tabular_1.0.0.yaml",
  "modelToUpload": {
    "displayName": "myModelName"
  },
  "state": "PIPELINE_STATE_PENDING",
  "createTime": "2020-08-18T01:22:57.479336Z",
  "updateTime": "2020-08-18T01:22:57.479336Z"
}

Python

To learn how to install and use the client library for Vertex AI, see Vertex AI client libraries. For more information, see the Vertex AI Python API reference documentation.

from google.cloud import aiplatform
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value


def create_training_pipeline_tabular_forecasting_sample(
    project: str,
    display_name: str,
    dataset_id: str,
    model_display_name: str,
    target_column: str,
    time_series_identifier_column: str,
    time_column: str,
    time_series_attribute_columns: str,
    unavailable_at_forecast: str,
    available_at_forecast: str,
    forecast_horizon: int,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.PipelineServiceClient(client_options=client_options)
    # set the columns used for training and their data types
    transformations = [
        {"auto": {"column_name": "date"}},
        {"auto": {"column_name": "state_name"}},
        {"auto": {"column_name": "county_fips_code"}},
        {"auto": {"column_name": "confirmed_cases"}},
        {"auto": {"column_name": "deaths"}},
    ]

    data_granularity = {"unit": "day", "quantity": 1}

    # the inputs should be formatted according to the training_task_definition yaml file
    training_task_inputs_dict = {
        # required inputs
        "targetColumn": target_column,
        "timeSeriesIdentifierColumn": time_series_identifier_column,
        "timeColumn": time_column,
        "transformations": transformations,
        "dataGranularity": data_granularity,
        "optimizationObjective": "minimize-rmse",
        "trainBudgetMilliNodeHours": 8000,
        "timeSeriesAttributeColumns": time_series_attribute_columns,
        "unavailableAtForecast": unavailable_at_forecast,
        "availableAtForecast": available_at_forecast,
        "forecastHorizon": forecast_horizon,
    }

    training_task_inputs = json_format.ParseDict(training_task_inputs_dict, Value())

    training_pipeline = {
        "display_name": display_name,
        "training_task_definition": "gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_forecasting_1.0.0.yaml",
        "training_task_inputs": training_task_inputs,
        "input_data_config": {
            "dataset_id": dataset_id,
            "fraction_split": {
                "training_fraction": 0.8,
                "validation_fraction": 0.1,
                "test_fraction": 0.1,
            },
        },
        "model_to_upload": {"display_name": model_display_name},
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_training_pipeline(
        parent=parent, training_pipeline=training_pipeline
    )
    print("response:", response)

Control the data split using REST

You can control how your training data is split between the training, validation, and test sets. When you use the Vertex AI API, use the Split object to determine your data split. For forecast models, the Split object can be included in the InputConfig object as a PredefinedSplit:

  • PredefinedSplit:

    • DATA_SPLIT_COLUMN: The column containing the data split values (TRAIN, VALIDATION, TEST).

    Manually specify the data split for each row by using a split column. Learn more.

    "predefinedSplit": {
      "key": DATA_SPLIT_COLUMN
    },
    

Configure the forecasting window using REST

A rolling window strategy enables you to generate windows from the input data, specifying what training data is the most important to capture. Each window is a series of rows composed of:

  • The context, or historical data, up to the time of prediction.
  • The horizon, or rows used for prediction.

Taken together, the rows in the window define a time-series instance that serves as a model input: it is what Vertex AI trains on, evaluates on, and uses for prediction.

The row used to generate the window is the first row of the horizon and uniquely identifies the window in the time series.

In certain situations, you can get a better model by generating fewer than the default 100,000,000 windows. The data seen during training with a lower window count may be distributed more evenly.

Select one of the following rolling window strategies for forecast window generation:

  • maxCount:

    • MAX_COUNT_VALUE: The maximum number of windows. The default value is 100,000,000.

    Vertex AI generates a maximum of MAX_COUNT_VALUE windows from the input dataset. If the number of rows in the input dataset is less than the number of windows, every row will be used to generate a window. Otherwise, Vertex AI performs random sampling to select the rows.

    To use this option, add the following to "trainingTaskInputs" of the JSON request:

    "windowConfig": {
      "maxCount": MAX_COUNT_VALUE
    },
    
  • strideLength:

    • STRIDE_LENGTH_VALUE: The number of input rows used to generate a window. The value can be between 1 and 1000 and defaults to 1.

    Vertex AI uses one out of every STRIDE_LENGTH_VALUE input rows to generate a window. This option is useful for seasonal or periodic predictions. For example, you can limit forecasting to a single day of the week by setting STRIDE_LENGTH_VALUE to 7.

    To use this option, add the following to "trainingTaskInputs" of the JSON request:

    "windowConfig": {
      "strideLength": STRIDE_LENGTH_VALUE
    },
    
  • column:

    • COLUMN_NAME: The name of the column with True / False values which specify which rows should be used for sliding window generation.

    You can add a column to your input data where the values are either True or False. Vertex AI generates a window for every input row where the value of the column is True. The True and False values can be set in any order, as long as the total count of True rows is less than 100 million.

    To use this option, add the following to "trainingTaskInputs" of the JSON request:

    "windowConfig": {
      "column": "COLUMN_NAME"
    },
    

Optimization objectives for forecast models

When you train a model, Vertex AI selects a default optimization objective based on your model type and the data type used for your target column.

The following table provides some details about what kinds of problems forecast models are best for:

Optimization objective API value Use this objective if you want to...
RMSE minimize-rmse Minimize root-mean-squared error (RMSE). Captures more extreme values accurately and is less biased when aggregating predictions. Default value.
MAE minimize-mae Minimize mean-absolute error (MAE). Views extreme values as outliers with less impact on model.
RMSLE minimize-rmsle Minimize root-mean-squared log error (RMSLE). Penalizes error on relative size rather than absolute value. Useful when both predicted and actual values can be quite large.
RMSPE minimize-rmspe Minimize root-mean-squared percentage error (RMSPE). Captures a large range of values accurately. Similar to RMSE, but relative to target magnitude. Useful when the range of values is large.
WAPE minimize-wape-mae Minimize the combination of weighted absolute percentage error (WAPE) and mean-absolute-error (MAE). Useful when the actual values are low.
Quantile loss minimize-quantile-loss Minimize the scaled pinball loss of the defined quantiles to quantify uncertainty in estimates.

What's next