Quota policy

AI Platform Prediction limits resource allocation and use, and enforces appropriate quotas on a per-project basis. Specific policies vary depending on resource availability, user profile, service usage history, and other factors, and are subject to change without notice.

The sections below outline the current quota limits of the system.

Limits on service requests

You may only make a limited number of individual API requests per 60-second interval. Each limit applies to a particular API or group of APIs as described in the following sections.

You can see your project's request quotas in the API Manager for AI Platform Prediction on Google Cloud console. You can apply for a quota increase by clicking the edit icon next to the quota limit and then clicking Apply for higher quota.

Job requests

The following limits apply to projects.jobs.create requests (training and batch prediction jobs combined):

Period Limit
60 seconds 60

Online prediction requests

The following limits apply to projects.predict requests:

Period Limit
60 seconds 600,000

Resource management requests

The following limits apply to the combined total all of the supported requests in this list:

Period Limit
60 seconds 300

In addition, all of the delete requests listed above and all version create requests are limited to 10 concurrent combined total requests.

Resource quotas

In addition to the limits on requests over time, you have a limit to resource use as shown in the following list:

  • Maximum number of models: 100.
  • Maximum number of versions: 200. The version limit is for the total number of versions in your project, which can be distributed among your active models however you want.

Model size limits

When you create a model version, the total file size of your model directory must be 500 MB or less if you use a legacy (MLS1) machine type or 10 GB or less if you use a Compute Engine (N1) machine type. Learn more about machine types for online prediction.

You cannot request an increase for these model size limits.

Limits on concurrent usage of virtual machines

Your project's usage of Google Cloud processing resources is measured by the number of virtual machines that it uses. This section describes the limits to concurrent usage of these resources across your project.

Limits on concurrent nodes for batch prediction

A typical project, when first using AI Platform Prediction, is limited in the number of concurrent nodes used for batch prediction:

  • Concurrent number of prediction nodes: 72.

Node usage for online prediction

AI Platform Prediction does not apply quotas to node usage for online prediction. See more about prediction nodes and resource allocation.

Limits on concurrent vCPU usage for online prediction

A typical project, when first using AI Platform Prediction is limited the following number of concurrent vCPUs on each regional endpoint when you use Compute Engine (N1) machine types. Different regional endpoints might have different quotas, and the quotas for your project might change over time.

Total concurrent number of vCPUs on each regional endpoint:

  • us-central1: 450
  • us-east1: 450
  • us-east4: 20
  • us-west1: 450
  • northamerica-northeast1: 20
  • europe-west1: 450
  • europe-west2: 20
  • europe-west3: 20
  • europe-west4: 450
  • asia-east1: 450
  • asia-northeast1: 20
  • asia-southeast1: 450
  • australia-southeast1: 20

These are the default quotas, and you can request increased quotas.

Limits on concurrent GPU usage for online prediction

A typical project, when first using AI Platform Prediction, is limited to the following number of concurrent GPUs on each regional endpoint. Different regional endpoints might have different quotas, and the quotas for your project might change over time.

Total concurrent number of GPUs: This is the maximum number of GPUs in concurrent use, split by type and regional endpoint as follows:

  • Concurrent number of Tesla K80 GPUs:
    • us-central1: 30
    • us-east1: 30
    • europe-west1: 30
    • asia-east1: 30
  • Concurrent number of Tesla P4 GPUs:
    • us-central1: 2
    • us-east4: 2
    • northamerica-northeast1: 2
    • europe-west4: 2
    • asia-southeast1: 2
    • australia-southeast1: 2
  • Concurrent number of Tesla P100 GPUs:
    • us-central1: 30
    • us-east1: 30
    • us-west1: 30
    • europe-west1: 30
    • asia-southeast1: 30
  • Concurrent number of Tesla T4 GPUs:
    • us-central1: 6
    • us-east1: 6
    • us-west1: 6
    • europe-west2: 2
    • europe-west4: 6
    • asia-northeast1: 2
    • asia-southeast1: 6
  • Concurrent number of Tesla V100 GPUs:
    • us-central1: 2
    • us-west1: 2
    • europe-west4: 2

These are the default quotas, and you can request increased quotas.

The GPUs that you use for prediction are not counted as GPUs for Compute Engine, and the quota for AI Platform Prediction does not give you access to any Compute Engine VMs using GPUs. If you want to spin up a Compute Engine VM using a GPU, you must request Compute Engine GPU quota, as described in the Compute Engine documentation.

For more information, see how to use GPUs for online prediction.

Requesting a quota increase

The quotas listed on this page are allocated per project, and may increase over time with use. If you need more processing capability, you can apply for a quota increase in one of the following ways:

  • Use the Google Cloud console to request increases for quotas that are listed in the API Manager for AI Platform Prediction:

    1. Find the section of the quota that you want to increase.

    2. Click the pencil icon next to the quota value at the bottom of the usage chart for that quota.

    3. Enter your requested increase:

      • If your desired quota value is within the range displayed on the quota limit dialog, enter your new value and click Save.

      • If you want to increase the quota beyond the maximum displayed, click Apply for higher quota and follow the instructions for the second way to request an increase.

  • If you want to increase a quota that isn't listed in the Google Cloud console, such as GPU quotas, use the AI Platform Quota Request form to request a quota increase. These requests are handled on a best-effort basis, which means there are no service-level agreements (SLAs) or service-level objectives (SLOs) involved in the review of these requests.

What's next