Learning Effective Loss Functions

This experiment lets the customer learn a problem-specific loss function that, when minimized over a training dataset, produces a model with good performance on held-out test data.

Apply for access Private documentation

Intended use

Problem types: Most machine learning models are obtained by minimizing some loss function over a training dataset, but the models are ultimately judged based on performance on test data, often using metrics that are only loosely related to the loss function that was minimized during training. Often, the class of potentially useful loss functions can be written as a linear function of a set of hyperparameters (see examples below). When this is the case, this experiment provides an efficient way to find good values for the loss function hyperparameters.

Our experiment optimizes the hyperparameters of linear loss functions, which are loss functions that can be written in the form:

Loss function

where feature_vector(θ) is a user-defined feature vector of length k, and λ is a hyperparameter vector of length k.

As an example, suppose we are doing logistic regression with L1 and L2 regularization. The loss function is of the form:

Loss function

This can be written as a linear loss function, with

Loss function

The experiment provides an efficient way to tune the hyperparameters, λ1 and λ2. (Formally we also have a third hyperparameter, λ3, whose value is constrained to be 1.)

As a second example, suppose we are training an image classifier using data augmentation. Each possible data augmentation operation perturbs a training image in some way (for example by flipping it horizontally, or converting it to grayscale). When training on a particular image, we first perturb it using a randomly-selected data augmentation operation (which may be a no-op), where operation i is selected with probability pi. The expected loss is therefore of the form:

Loss function

Where log_lossi(θ) is the log loss of θ on images perturbed using operation i. This experiment provides an efficient way to come up with values for the probabilities pi. For example, we might learn that the best validation loss is obtained if we horizontally flip an image 20% of the time, convert it to grayscale 10% of the time, and train on the unperturbed image the remaining 70% of the time.

Note that in this example we want to learn probabilities pi that sum to 1, whereas the experiment provides arbitrary non-negative multipliers λi. However, we can convert the output of the experiment to the desired form by rescaling (setting pi = λi /Z, where Z = Σi λi)

Inputs and outputs:

  • Users provide: data points describing the performance of a model on training and validation datasets. Each data point includes (1) the validation error of some model θ and (2) the feature vector for the model θ.
  • Users receive: recommended loss functions which, when minimized over the training dataset, lead to better performance on the validation set (according to a user-defined performance metric).

Industries and functions: Use cases are not constrained to specific industries or functions. This experiment may be helpful for any problem where the choice of loss function has a significant impact on the quality of the trained model.

Technical challenges: This experiment will be most helpful in cases where:

  • Users can define a set of linear loss functions with n hyperparameters, such that optimizing these n hyperparameters has the potential to significantly improve validation loss (relative to some baseline).
  • The number of hyperparameters, n, should be large enough that simpler methods such as grid search or random search are unlikely to be effective. We recommend n > 3.
  • Users can train enough models to allow good values for the hyperparameters to be found. For linear loss functions with n hyperparameters, we recommend training at least n+1 models (where each model is trained using a different set of hyperparameter values).

What data do I need?

Data and label types: This experiment requires as input a set of data points, each of which gives the validation loss and feature vector for a particular model.

Specifications:

  • The feature vector have the same length, k, for all data points
  • k should be less than 1000
  • For best results, there should be at least k+1 data points. However, not all data points need to be provided up front. Rather, some data points may describe the performance of models that were obtained using a loss function recommended by the experiment (see user journey).

What skills do I need?

As with all AI Workshop experiments, successful users are likely to be savvy with core AI concepts and skills in order to both deploy the experiment technology and interact with our AI researchers and engineers.

In particular, users of this experiment should:

  • Be familiar with accessing Google APIs
  • Be familiar with the high-level problem of tuning loss function hyperparameters.
  • Be able to define a family of linear loss functions, and be able to train a model using a particular linear loss function (i.e., a particular vector of hyperparameters).
  • Be able to compute the validation of a trained model, according to some user-defined metric of interest.