Troubleshoot Cloud Run issues

This page shows you how to resolve issues with Cloud Run.

For other issues not listed below, check to see if they may be a known issue.

Deployment errors

This section lists issues that you might encounter with deployment and provides suggestions for how to fix each of them.

Container failed to start

The following error occurs when you try to deploy:

Container failed to start. Failed to start and then listen on the port defined by the PORT environment variable.

To resolve this issue, rule out the following potential causes:

Verify that you can run your container image locally. If your container image cannot run locally, you need to diagnose and fix the issue locally first.
Check if your container is listening for requests on the expected port as documented in the container runtime contract. Your container must listen for incoming requests on the port that is defined by Cloud Run and provided in the PORT environment variable. See Configuring containers for instructions on how to specify the port.
Check if your container is listening on all network interfaces, commonly denoted as 0.0.0.0.
Verify that your container image is compiled for 64-bit Linux as required by the container runtime contract.

Note: If you build your container image on a ARM based machine, then it might not work as expected when used with Cloud Run. To solve this issue, build your image using Cloud Build.
Use Cloud Logging to look for application errors in stdout or stderr logs. You can also look for crashes captured in Error Reporting.

You might need to update your code or your revision settings to fix errors or crashes. You can also troubleshoot your service locally.

Internal error, resource readiness deadline exceeded

The following error occurs when you try to deploy or try to call another Google Cloud API:

The server has encountered an internal error. Please try again later. Resource readiness deadline exceeded.

This issue might occur when the Cloud Run service agent does not exist, or when it does not have the Cloud Run Service Agent (roles/run.serviceAgent) role.

To verify that the Cloud Run service agent exists in your Google Cloud project and has the necessary role, perform the following steps:

Open the Google Cloud console:

Go to the Permissions page
In the upper-right corner of the Permissions page, select the Include Google-provided role grants checkbox.
In the Principals list, locate the ID of the Cloud Run service agent, which uses the ID
service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com.
Verify that the service agent has the Cloud Run Service Agent role. If the service agent does not have the role, grant it.

Error user 'root' is not found in `/etc/passwd`

The following error occurs when you try to deploy:

ERROR: "User \"root\""not found in /etc/passwd

The issue occurs when customer managed encryption keys are specified using a --key parameter

To resolve this issue, specify USER 0 instead of USER root in the Dockerfile.

Default Compute Engine service account is deleted

The following error occurs when you try to deploy:

ERROR: (gcloud.run.deploy) User EMAIL_ADDRESS does not have permission to access namespace NAMESPACE_NAME (or it may not exist): Permission 'iam.serviceaccounts.actAs' denied on service account PROJECT_NUMBER-compute@developer.gserviceaccount.com (or it may not exist).

This issue occurs in one of the following situations:

The default Compute Engine service account does not exist in the project, and no service account is specified with the --service-account gcloud flag at the time of deployment.
The developer or principal deploying the service does not have the permissions for the default Compute Engine service account that are required to deploy.

To resolve this issue:

Specify a service account using the --service-account gcloud flag.
Verify that the service account you specify has the permissions required to deploy.

If you want to verify if the default Compute Engine service agent exists in your Google Cloud project, perform the following steps:

Open the Google Cloud console:

Go to the Permissions page
In the upper-right corner of the Permissions page, select the Include Google-provided role grants checkbox.
In the Principals list, locate the ID of the Compute Engine service agent, which uses the ID
PROJECT_NUMBER-compute@developer.gserviceaccount.com.

Cloud Run Service Agent doesn't have permission to read the image

The following error occurs when you try to deploy from PROJECT-ID using an image that is stored in Container Registry in PROJECT-ID-2:

Google Cloud Run Service Agent must have permission to read the image, gcr.io/PROJECT-ID/IMAGE-NAME. Ensure that the provided container image URL is correct and that above account has permission to access the image. If you just enabled the Cloud Run API, the permissions might take a few minutes to propagate. Note that PROJECT-ID/IMAGE-NAME is not in project PROJECT-ID-2. Permission must be granted to the Google Cloud Run Service Agent from this project.

To resolve this issue, follow these troubleshooting recommendations:

Follow the instructions for deploying container images from other Google Cloud projects to ensure that your principals have the necessary permissions.
This issue might also occur if the project is in a VPC Service Controls perimeter with a restriction on the Cloud Storage API that prohibits requests from the Cloud Run service agent. To fix this:
1. Open Logs Explorer in the Google Cloud console. (Do not use the Logs page inside the Cloud Run page):
  
  Go to Logs Explorer
2. Enter the following text in the query field:
```
protoPayload.@type="type.googleapis.com/google.cloud.audit.AuditLog"
severity=ERROR
protoPayload.status.details.violations.type="VPC_SERVICE_CONTROLS"
protoPayload.authenticationInfo.principalEmail="service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com"
```
3. If you see any log entries after you use this query, examine the log entries to determine whether you need to update your VPC Service Controls policies. They may indicate that you need to add service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com to a pre-existing access policy.

Container import errors

The following error occurs when you try to deploy:

The service has encountered an error during container import. Please try again later. Resource readiness deadline exceeded.

To resolve this issue, rule out the following potential causes:

Ensure container's file system does not contain non-utf8 characters.
Some Windows based Docker images make use of foreign layers. Although Container Registry doesn't throw an error when foreign layers are present, Cloud Run's control plane does not support them. To resolve, you may try setting the --allow-nondistributable-artifacts flag in the Docker daemon.

Serving errors

This section lists issues that you might encounter with serving and provides suggestions for how to fix each of them.

HTTP 401: Client is not authenticated properly

The following error occurs during serving:

The request was not authorized to invoke this service

To resolve this issue:

If invoked by a service account, the audience claim (aud) of the Google-signed ID token must be set to the following:
- The Cloud Run URL of the receiving service, using the form https://service-xyz.run.app.
  - The Cloud Run service must require authentication.
  - The Cloud Run service can be invoked by the Cloud Run URL or through a load balancer URL.
- The Client ID of an OAuth 2.0 Client ID with type Web application, using the form nnn-xyz.apps.googleusercontent.com.
  - The Cloud Run service can be invoked through an HTTPS load balancer secured by IAP.
  - This is great for a GCLB backed by multiple Cloud Run services in different regions.
- A configured custom audience using the exact values provided. For example, if custom audience is service.example.com, the audience claim (aud) value must also be service.example.com. If custom audience is https://service.example.com, the audience claim value must also be https://service.example.com.
The jwt.io tool is good for checking claims on a JWT.
If the auth token is of an invalid format a 401 error occurs. If the token is of a valid format and the IAM member used to generate the token is missing the run.routes.invoke permission, a 403 error occurs.

HTTP 403: Client is not authorized to invoke or call the service

The following error might or might not be in Cloud Logging with resource.type = "cloud_run_revision":

The request was not authenticated. Either allow unauthenticated invocations or set the proper Authorization header

The following error is present in the HTTP response returned to the client:

403 Forbidden
Your client does not have permission to get URL from this server.

To resolve this issue when the resource.type = "cloud_run_revision" Cloud Logging error is present:

If the service is meant to be invocable by anyone, update its IAM settings to make the service public.
If the service is meant to be invocable only by certain identities, make sure that you invoke it with the proper authorization token.
- If invoked by a developer or invoked by an end user: Ensure that the developer or user has the run.routes.invoke permission, which you can provide through the Cloud Run Admin (roles/run.admin) and Cloud Run Invoker (roles/run.invoker) roles.
- If invoked by a service account: Ensure that the service account is a member of the Cloud Run service and that it has the Cloud Run Invoker (roles/run.invoker) role.
- Calls missing an auth token or with an auth token that is of valid format, but the IAM member used to generate the token is missing the run.routes.invoke permission, result in this 403 error.

To resolve this issue when the resource.type = "cloud_run_revision" Cloud Logging error is not present:

A 403 status code can be returned when a service has ingress configured to All, but was blocked due to VPC Service Controls. See the next section on 404 errors for more information on troubleshooting VPC Service Controls denials.

HTTP 404: Not Found

The following issue occurs during serving:

You encounter an HTTP 404 error.

To resolve this issue:

Verify that the URL you are requesting is correct by checking the service detail page in the Cloud console or by running the following command:
```
gcloud run services describe SERVICE_NAME | grep URL
```
Inspect where your app logic might be returning 404 error codes. If your app is returning the 404, it will be visible in Cloud Logging.
Make sure your app does not start listening on its configured port before it is ready to receive requests.
Verify that the app does not return a 404 error code when you run it locally.

A 404 is returned when a Cloud Run service's ingress settings are set to "Internal" or "Internal and Cloud Load Balancing" and a request does not satisfy the specified network restriction. In this scenario, the request does not reach the container and the 404 is not present in Cloud Logging with the following filter:

resource.type="cloud_run_revision"
log_name="projects/PROJECT_ID/logs/run.googleapis.com%2Frequests"
httpRequest.status=404

With the same ingress settings the request might be blocked by VPC Service Controls based on the caller's context including project and IP address. To check for a VPC Service Controls policy violation:

Open Logs Explorer in the Google Cloud console (not the Logs page for Cloud Run):

Go to Logs Explorer

Enter the following text in the query field:

resource.type="audited_resource"
log_name="projects/PROJECT_ID/logs/cloudaudit.googleapis.com%2Fpolicy"
resource.labels.method="run.googleapis.com/HttpIngress"

If you see any log entries after you use this query, examine the log entries to determine whether or not you need to update your VPC Service Controls policies.

HTTP 429: No available container instances

The following error occurs during serving:

HTTP 429
The request was aborted because there was no available instance.
The Cloud Run service might have reached its maximum container instance
limit or the service was otherwise not able to scale to incoming requests.
This might be caused by a sudden increase in traffic, a long container startup time or a long request processing time.

To resolve this issue, check the "Container instance count" metric for your service and consider increasing this limit if your usage is nearing the maximum. See "max instance" settings, and if you need more instances, request a quota increase.

HTTP 500: Cloud Run couldn't manage the rate of traffic

The following error occurs during serving and can also occur when the service has not reached its maximum container instance limit:

HTTP 500
The request was aborted because there was no available instance

This error can be caused by one of the following:

A huge sudden increase in traffic.
A long cold start time.
A long request processing time, or a sudden increase in request processing time.
The service reaching its maximum container instance limit (HTTP 429).
Transient factors attributed to the Cloud Run service.

To resolve this issue, address the previously listed issues.

In addition to fixing these issues, as a workaround you can implement exponential backoff and retries for requests that the client must not drop.

Note that a short and sudden increase in traffic or request processing time might only be visible in Cloud Monitoring if you zoom in to 10 second resolution.

When the root cause of the issue is a period of heightened transient errors attributable solely to Cloud Run, you can contact Support

HTTP 500 / HTTP 503: Container instances are exceeding memory limits

The following error occurs during serving:

In Cloud Logging:

While handling this request, the container instance was found to be using too much memory and was terminated. This is likely to cause a new container instance to be used for the next request to this revision. If you see this message frequently, you may have a memory leak in your code or may need more memory. Consider creating a new revision with more memory.

To resolve this issue:

Determine if your container instances are exceeding the available memory. Look for related errors in the varlog/system logs.
If the instances are exceeding the available memory, consider increasing the memory limit.

Note that in Cloud Run, files written to the local filesystem count towards the available memory. This also includes any log files that are written to locations other than /var/log/* and /dev/log.

HTTP 503: Malformed response or container instance connection issue

One of the following errors occurs during serving:

HTTP 503
The request failed because either the HTTP response was malformed or connection to the instance had an error.

To resolve this issue, follow these troubleshooting recommendations:

Check Cloud Logging Use Cloud Logging to look for out of memory errors in the logs. If you see error messages regarding container instances exceeding memory limits, follow the recommendations to resolve this issue.
App-level timeouts If requests are terminating with error code 503 before reaching the request timeout set in Cloud Run, you might need to update the request timeout setting for your language framework:
- Node.js developers might need to update the server.timeout property via server.setTimeout (use server.setTimeout(0) to achieve an unlimited timeout) depending on the version you are using.
- [CRITICAL] WORKER TIMEOUT Python developers need to update Gunicorn's default timeout.
Downstream network bottleneck In some instances a 503 error code can result indirectly from a downstream network bottleneck, such as during load testing. For example, if your service routes traffic through a Serverless VPC Access connector, ensure that the connector has not exceeded its throughput threshold by following these steps:
1. Open Serverless VPC Access in the Google Cloud console:
  
  Go to Serverless VPC Access
2. Check for any red bars in the throughput chart histogram. If there is a red bar consider increasing the max instances or instance type your connector uses. Alternatively, compress traffic sent through a Serverless VPC Access connector.
Inbound request limit to a single container There is a known issue where there are high request rates per container that will result in this 503 error. If a container instance receives more than 800 requests per second, the available TCP sockets can be exhausted. To remedy this, try any of the following:
1. Turn on HTTP/2 for your service and make any needed changes to your service to support HTTP/2.
2. Avoid sending more than 800 requests/second to a single container instance by lowering the configured concurrency. Use the following equation to get an estimate on request rate per container instance: requests/sec/container_instance ~= concurrency * (1sec / median_request_latency)

HTTP 503: Unable to process some requests due to high concurrency setting

The following errors occurs during serving:

HTTP 503
The Cloud Run service probably has reached its maximum container instance limit. Consider increasing this limit.

This issue occurs when your container instances are using a lot of CPU to process requests, and as a result, the container instances cannot process all of the requests, so some requests return a 503 error code.

To resolve this issue, try one or more of the following:

Increase the maximum number of container instances for your service.
Lower the service's concurrency. Refer to setting concurrency for more detailed instructions.

HTTP 504: Gateway timeout error

HTTP 504
The request has been terminated because it has reached the maximum request timeout.

If your service is processing long requests, you can increase the request timeout. If your service doesn't return a response within the time specified, the request ends and the service returns an HTTP 504 error, as documented in the container runtime contract.

To troubleshoot this issue, try one or more of the following:

Instrument logging and tracing to understand where your app is spending time before exceeding your configured request timeout.
Outbound connections are reset occasionally, due to infrastructure updates. If your application reuses long-lived connections, then we recommend that you configure your application to re-establish connections to avoid the reuse of a dead connection.
- Depending on your app's logic or error handling, a 504 error might be a signal that your application is trying to reuse a dead connection and the request blocks until your configured request timeout.
- You can use a liveness probe to help terminate an instance that returns persistent errors.
Out of memory errors that happen inside the app's code, for example, java.lang.OutOfMemoryError, do not necessarily terminate a container instance. If memory usage does not exceed the container memory limit, then the instance will not be terminated. Depending on how the app handles app-level out of memory errors, requests might hang until they exceed your configured request timeout.
- If you want the container instance to terminate in the above scenario, then configure your app-level memory limit to be greater than your container memory limit.
- You can use a liveness probe to help terminate an instance that returns persistent errors.

Connection reset by peer

One of the following errors occurs during serving:

Connection reset by peer

asyncpg.exceptions.ConnectionDoesNotExistError: connection was closed in the middle of operation

grpc.StatusRuntimeException: UNAVAILABLE: io exception

psycopg.OperationalError: the connection is closed

ECONNRESET

This error occurs when an application has an established TCP connection with a peer across the network and that peer unexpectedly closes the connection.

To resolve this issue:

If you are trying to perform background work with CPU throttling, try using the "CPU is always allocated" CPU allocation setting.
Ensure that you are within the outbound requests timeouts. If your application maintains any connection in an idle state beyond this thresholds, the gateway needs to reap the connection.
By default, the TCP socket option keepalive is disabled for Cloud Run. There is no direct way to configure the keepalive option in Cloud Run at the service level, but you can enable the keepalive option for each socket connection by providing the correct socket options when opening a new TCP socket connection, depending on the client library that you are using for this connection in your application.
Occasionally outbound connections will be reset due to infrastructure updates. If your application reuses long-lived connections, then we recommend that you configure your application to re-establish connections to avoid the reuse of a dead connection.

Identity token signature redacted by Google

The following errors occurs during serving:

SIGNATURE_REMOVED_BY_GOOGLE

This can occur during development and testing in the following circumstance:

A user logs in using Google Cloud CLI or Cloud Shell.
The user generates an ID token using gcloud commands.
The user tries to use the ID token to invoke a non-public Cloud Run service.

This is by design. Google removes the token signature due to security concerns to prevent any non-public Cloud Run service from replaying ID tokens that are generated in this manner.

To resolve this issue, invoke your private service with a new ID token. Refer to testing authentication in your service for more information.

OpenBLAS warning in logs

If you use OpenBLAS-based libraries such as NumPy with the first generation execution environment, you might see the following warning in your logs:

OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k

This is just a warning and it doesn't impact your service. This warning results because the container sandbox used by the first generation execution environment does not expose low level hardware features. You can optionally switch to the second generation execution environment if you don't want to have these warnings in your logs.

Spark fails when obtaining IP address of machine to bind to

One of the following errors occurs during serving:

assertion failed: Expected hostname (not IP) but got <IPv6 ADDRESS>

assertion failed: Expected hostname or IPv6 IP enclosed in [] but got <IPv6 ADDRESS>

To resolve this issue:

To change the environment variable value and resolve the issue, set ENV SPARK_LOCAL_IP="127.0.0.1" in your Dockerfile. In Cloud Run, if the variable SPARK_LOCAL_IP is not set, it will default to its IPv6 counterpart instead of localhost. Note that setting RUN export SPARK_LOCAL_IP="127.0.0.1" will not be available on runtime and Spark will act as if it was not set.

Mapping custom domains

Custom domain is stuck certificate provisioning state

One of the following errors occurs when you try to map a custom domain:

The domain is available over HTTP.  Waiting for certificate provisioning. You must configure your DNS records for certificate issuance to begin and to accept HTTP traffic.

Waiting for certificate provisioning. You must configure your DNS records for certificate issuance to begin.

To resolve this issue:

Wait at least 24 hours. Provisioning the SSL certificate usually takes about 15 minutes, but it can take up to 24 hours.
Verify that you've properly updated your DNS records at your domain registrar using the Google Admin Toolbox dig tool

The DNS records in your domain registrar need to match what the Google Cloud console prompts you to add.
Confirm that the root of the domain is verified under your account using one of the following methods:
- Follow the instructions for adding verified domain owners and check that your account is listed as a Verified Owner.
- Use the Search Console.

Verify that the certificate for the domain is not expired. To find the expiry bounds, use the following command:

echo | openssl s_client -servername 'ROOT_DOMAIN' -connect 'ROOT_DOMAIN:443' 2>/dev/null | openssl x509 -startdate -enddate -noout

Admin API

The feature is not supported in the declared launch stage

The following error occurs when you call the Cloud Run Admin API:

The feature is not supported in the declared launch stage

This error occurs when you call the Cloud Run Admin API directly and use a beta feature without specifying a launch stage annotation or field.

To resolve this issue, add the launch stage field to the requests. Below are examples for the v1 REST API and the v2 REST API:

The following example adds a launch stage annotation to a client request using JSON and the v1 REST API:

    "annotations": {
      "run.googleapis.com/launch-stage": "BETA"
    }

The following example adds a LaunchStage reference to a client request using JSON and the v2 REST API:

  "launchStage": "BETA"

The following example adds a launch stage annotation to a service request using YAML and the v1 REST API:

kind: Service
metadata:
  annotations:
    run.googleapis.com/launch-stage: BETA

Client disconnect does not propagate to Cloud Run

When you use HTTP/1.1 on Cloud Run, client disconnect events are not propagated to the Cloud Run container.

The solution is to use Websockets or HTTP/2.0, which do propagate client disconnects.

Troubleshooting network file system issues

Learn more about Using network file systems.

Cannot access files using NFS

Error	Suggested remedy
`mount.nfs: Protocol not supported`	Some base images, for example `debian` and `adoptopenjdk/openjdk11`, are missing dependency nfs-kernel-server.
`mount.nfs: Connection timed out`	If the connection times out, make sure you are providing the correct IP address of the filestore instance.
`mount.nfs: access denied by server while mounting IP_ADDRESS:/FILESHARE`	If access was denied by the server, check to make sure the file share name is correct.

Cannot access files using Cloud Storage FUSE

See Cloud Storage FUSE troubleshooting guide.

Troubleshoot Cloud Run issues

Deployment errors

Container failed to start

Internal error, resource readiness deadline exceeded

Error user 'root' is not found in /etc/passwd

Default Compute Engine service account is deleted

Cloud Run Service Agent doesn't have permission to read the image

Container import errors

Serving errors

HTTP 401: Client is not authenticated properly

HTTP 403: Client is not authorized to invoke or call the service

HTTP 404: Not Found

HTTP 429: No available container instances

HTTP 500: Cloud Run couldn't manage the rate of traffic

HTTP 500 / HTTP 503: Container instances are exceeding memory limits

HTTP 503: Malformed response or container instance connection issue

HTTP 503: Unable to process some requests due to high concurrency setting

HTTP 504: Gateway timeout error

Connection reset by peer

Identity token signature redacted by Google

OpenBLAS warning in logs

Spark fails when obtaining IP address of machine to bind to

Mapping custom domains

Custom domain is stuck certificate provisioning state

Admin API

The feature is not supported in the declared launch stage

Client disconnect does not propagate to Cloud Run

Troubleshooting network file system issues

Cannot access files using NFS

Cannot access files using Cloud Storage FUSE

Error user 'root' is not found in `/etc/passwd`