General guidance on conducting A/B experiments

This page describes how you can use A/B experiments to understand how Vertex AI Search for commerce is impacting your business.

Overview

An A/B experiment is a randomized experiment with two groups: an experimental group and a control group. The experimental group receives some different treatment (in this case, predictions or search results from Vertex AI Search for commerce); the control group does not.

When you run an A/B experiment, you include the information about which group a user was in when you record user events. That information is used to refine the model and provide metrics.

Both versions of your application must be the same, except that users in the experimental group see results generated by Vertex AI Search for commerce and the control group does not. You log user events for both groups.

For more on traffic splitting, see Splitting Traffic in the App Engine documentation.

Experiment platforms

Set up the experiment using a third-party experiment platform such as VWO, AB Tasty. The control and experimental groups each get a unique experiment ID from the platform. When you record a user event, specify which group the user is in by including the experiment ID in the experimentIds field. Providing the experiment ID lets you to compare the metrics for the versions of your application seen by the control and experimental groups.

Best practices for A/B experiments

The goal of an A/B experiment is to accurately determine the impact of updating your site (in this case, employing Vertex AI Search for commerce). To get an accurate measure of the impact, you must design and implement the experiment correctly, so that other differences don't creep in and impact the experiment results.

Experiment IDs are used for A/B testing, where you can compare Vertex AI Search for commerce against an existing search solution. They can also be used to run experiments with a fully adopted Vertex AI Search for commerce site where a new config, control, or boost spec, to name some examples, needs to be tested against a control group.

The experiment ID field in the user events is an array, which allows for more granular measurement.

Consider the following use cases:

  • Vertex AI Search for commerce performance needs to be compared against a control group.
  • The overall performance needs to be measured.
  • Mobile-only performance needs to be measured.
  • Desktop-only performance needs to be measured.
  • Search and recommendations performance needs to be measured separately as well.

To achieve such granular and sliced measurements, you might need a total of 10 experiment IDs, of which four must be sent in the events experiment ID array for every event.

Experiment IDs for event control group Experiment IDs for test (search for commerce) event group Scope of user events
Control Vertex AI Search for commerce All events
Control_mobile Google_mobile All mobile events
Control_desktop Google_desktop All desktop events
Control_search Google_search All search and related events
Control_recommendations Google_recommendations All recs and related events

To measure the overall performance, compare the metrics derived from events with experiment IDs Control and Vertex AI Search for commerce. To measure the mobile search performance, compare the metrics derived from events with the experiment IDs Control_mobile + Control_search versus Google_mobile + Google_search.

Category hierarchy

Make sure the same products have the same category hierarchy between the control and the test. Take, for example, in the control site, where a t-shirt product has the category hierarchy such that clothing > mens > tops > tee-shirts, and the same product is under a different category hierarchy in the test side, such that mens > popular > tops. This setup results in different search results and different category facets between the control and the test sites. This issue has an effect on the browser experience, since the page_category is the input to the browse call, along with filters.

User experience parity before A/B testing

When preparing the site for A/B testing, before serving real user search or recommendations traffic to Vertex AI Search for commerce with the correct experiment ID mapping, it's important to note the user interface and experience parity between the commerce site, with the legacy search backend used as the control and the site with the Vertex AI Search for commerce backend.

Given a search query, between the search result pages for the control search backend and the Vertex AI Search for commerce backend, some things to test for include:

Are the same number of facets showing up? If not, review the facet specs and attribute settings in Vertex AI Search for commerce . This is important because facets help users filter and navigate to the preferred product from the initial search results. Better and more meaningful facets mean users take less time to find the preferred product. Otherwise, it results in more clicks and scrolling, which might hamper the search experience and ultimately affect the conversion and click through rates. This might also result in search abandonment. Therefore, having similar facets between the control and test sites means there's no unfair advantage to users when searching for products between one over the other.

Sponsors' product placements in search results is often a common feature with many ecommerce sites, and mostly the sponsors' products are not part of the organic search results. Care should be taken to make sure the placement and the products shown in the search results page between the Control site and the Test site are almost same, if not identical. If not, it results in noise getting added to the revenue performance metrics measurement, and depending on the uniqueness of the sponsored products between the Control and Test sites, the noise could be on the higher side.

Other miscellaneous user interface aspects to consider:

  • Are the price and discount information the same between control and test sites?
  • Is the autocomplete suggesting the same completions for the search query?
  • Are the facet values in the same order?
  • Are the products listed using the same style, such as in a list or a grid?

Final tips and considerations

To design a meaningful A/B experiment, keep these tips in mind:

  • Before setting up your A/B experiment, use prediction or search preview to ensure that your model is behaving as you expect.

  • Make sure that the behavior of your site is identical for the experimental group and the control group.

    Site behavior includes latency, display format, text format, page layout, image quality, and image size. There should be no discernible differences for any of these attributes between the experience of the control and experiment groups.

  • Accept and display results as they are returned from Vertex AI Search for commerce, and display them in the same order as they are returned.

    Filtering out items that are out of stock is acceptable. However, you should avoid filtering or ordering results based on your business rules.

  • If you are using search user events and include the required attribution token with them, make sure they are set up correctly. See the documentation for Attribution tokens.

  • Make sure that the serving config you provide when you request recommendations or search results matches your intention for that recommendation or search result, and the location where you display the results.

    When you use recommendations, the serving config affects how models are trained and therefore what products are recommended. Learn more.

  • If you are comparing an existing solution with Vertex AI Search for commerce, keep the experience of the control group strictly segregated from the experience of the experimental group.

    If the control solution does not provide a recommendation or search result, don't provide one from Vertex AI Search for commerce in the control pages. Doing so skews your test results.

    Make sure your users don't switch between the control group and the experiment group. This is especially important within the same session, but also recommended across sessions. This improves experiment performance and helps you get statistically significant A/B test results sooner.