Tutorial: Local troubleshooting of a Cloud Run service

This tutorial shows how a service developer can troubleshoot a broken Cloud Run service using Stackdriver tools for discovery and a local development workflow for investigation.

This step-by-step "case study" companion to the troubleshooting guide uses a sample project that results in runtime errors when deployed, which you troubleshoot to find and fix the problem.

You can use this tutorial with Cloud Run (fully managed) or Cloud Run for Anthos on Google Cloud. You cannot use this tutorial with Cloud Run for Anthos on-prem due to Google Cloud's operations suite support limitations.

Objectives

  • Write, build, and deploy a service to Cloud Run
  • Use Error Reporting and Cloud Logging to identify an error
  • Retrieve the container image from Container Registry for a root cause analysis
  • Fix the "production" service, then improve the service to mitigate future problems

Costs

This tutorial uses billable components of Cloud Platform, including:

Use the Pricing Calculator to generate a cost estimate based on your projected usage.

New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the Cloud Run API
  5. Install and initialize the Cloud SDK.
  6. For Cloud Run for Anthos on Google Cloud install the gcloud kubectl component:
    gcloud components install kubectl
  7. Update components:
    gcloud components update
  8. If you are using Cloud Run for Anthos on Google Cloud, create a new cluster using the instructions in Setting up Cloud Run for Anthos on Google Cloud.
  9. If you are using Cloud Run for Anthos on Google Cloud, install curl to try out the service
  10. Follow the instructions to install Docker locally

Setting up gcloud defaults

To configure gcloud with defaults for your Cloud Run service:

  1. Set your default project:

    gcloud config set project PROJECT_ID

    Replace PROJECT_ID with the name of the project you created for this tutorial.

  2. If you are using Cloud Run (fully managed), configure gcloud for your chosen region:

    gcloud config set run/region REGION

    Replace REGION with the supported Cloud Run region of your choice.

  3. If you are using Cloud Run for Anthos on Google Cloud, configure gcloud for your cluster:

    gcloud config set run/cluster CLUSTER-NAME
    gcloud config set run/cluster_location REGION

    Replace

    • CLUSTER-NAME with the name you used for your cluster,
    • REGION with the supported cluster location of your choice.

Cloud Run locations

Cloud Run is regional, which means the infrastructure that runs your Cloud Run services is located in a specific region and is managed by Google to be redundantly available across all the zones within that region.

Meeting your latency, availability, or durability requirements are primary factors for selecting the region where your Cloud Run services are run. You can generally select the region nearest to your users but you should consider the location of the other Google Cloud products that are used by your Cloud Run service. Using Google Cloud products together across multiple locations can affect your service's latency as well as cost.

Cloud Run is available in the following regions:

Subject to Tier 1 pricing

  • asia-east1 (Taiwan)
  • asia-northeast1 (Tokyo)
  • asia-northeast2 (Osaka)
  • europe-north1 (Finland)
  • europe-west1 (Belgium)
  • europe-west4 (Netherlands)
  • us-central1 (Iowa)
  • us-east1 (South Carolina)
  • us-east4 (Northern Virginia)
  • us-west1 (Oregon)

Subject to Tier 2 pricing

  • asia-east2 (Hong Kong)
  • asia-northeast3 (Seoul, South Korea)
  • asia-southeast1 (Singapore)
  • asia-southeast2 (Jakarta)
  • asia-south1 (Mumbai, India)
  • australia-southeast1 (Sydney)
  • europe-west2 (London, UK)
  • europe-west3 (Frankfurt, Germany)
  • europe-west6 (Zurich, Switzerland)
  • northamerica-northeast1 (Montreal)
  • southamerica-east1 (Sao Paulo, Brazil)

Note that it is not possible to use the domain mapping feature of Cloud Run (fully managed) for services in these regions:

  • asia-east2
  • asia-northeast2
  • asia-northeast3
  • asia-southeast1
  • asia-southeast2
  • asia-south1
  • australia-southeast1
  • europe-west2
  • europe-west3
  • europe-west6
  • northamerica-northeast1
  • southamerica-east1
You can use Cloud Load Balancing with a serverless NEG to map a custom domain to Cloud Run (fully managed) services in these regions.

If you already created a Cloud Run service, you can view the region in the Cloud Run dashboard in the Cloud Console.

Assembling the code

Build a new Cloud Run greeter service step-by-step. As a reminder, this service creates a runtime error on purpose for the troubleshooting exercise.

  1. Create a new project:

    Node.js

    Create a Node.js project by defining the service package, initial dependencies, and some common operations.

    1. Create a new hello-service directory:

      mkdir hello-service
      cd hello-service
      
    2. Create a new Node.js project by generating a package.json file:

      npm init --yes
      npm install --save express@4
      
    3. Open the new package.json file in your editor and configure a start script to run node index.js. When you're done, the file will look like this:

      {
        "name": "hello-service",
        "version": "1.0.0",
        "description": "",
        "main": "index.js",
        "scripts": {
            "start": "node index.js",
            "test": "echo \"Error: no test specified\" && exit 1"
        },
        "keywords": [],
        "author": "",
        "license": "ISC",
        "dependencies": {
            "express": "^4.17.1"
        }
      }

    If you continue to evolve this service beyond the immediate tutorial, consider filling in the description, author, and evaluate the license. For more details, read the package.json documentation.

    Python

    1. Create a new hello-service directory:

      mkdir hello-service
      cd hello-service
      
    2. Create a requirements.txt file and copy your dependencies into it:

      Flask==1.1.2
      pytest==5.3.0; python_version > "3.0"
      pytest==4.6.6; python_version < "3.0"
      gunicorn==20.0.4
      

    Go

    1. Create a new hello-service directory:

      mkdir hello-service
      cd hello-service
      
    2. Create a Go project by initializing a new go module:

      go mod init example.com/hello-service
      

    You can update the specific name as you wish: you should update the name if the code is published to a web-reachable code repository.

    Java

    1. Create a new maven project:

      mvn archetype:generate \
        -DgroupId=com.example.cloudrun \
        -DartifactId=hello-service \
        -DarchetypeArtifactId=maven-archetype-quickstart \
        -DinteractiveMode=false
      
    2. Copy the dependencies into your pom.xml dependency list (between the <dependencies> elements):

      <dependency>
        <groupId>com.sparkjava</groupId>
        <artifactId>spark-core</artifactId>
        <version>2.9.3</version>
      </dependency>
      <dependency>
        <groupId>ch.qos.logback</groupId>
        <artifactId>logback-classic</artifactId>
        <version>1.3.0-alpha5</version>
      </dependency>
      <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>1.7.30</version>
      </dependency>
      <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <version>1.7.30</version>
      </dependency>
      
    3. Copy the build setting into your pom.xml (under the <dependencies> elements):

      <build>
        <plugins>
          <plugin>
            <groupId>com.google.cloud.tools</groupId>
            <artifactId>jib-maven-plugin</artifactId>
            <version>2.6.0</version>
            <configuration>
              <to>
                <image>gcr.io/PROJECT_ID/hello-service</image>
              </to>
            </configuration>
          </plugin>
        </plugins>
      </build>
      

  2. Create an HTTP service to handle incoming requests:

    Node.js

    const express = require('express');
    const app = express();
    
    app.get('/', (req, res) => {
      console.log('hello: received request.');
    
      const {NAME} = process.env;
      if (!NAME) {
        // Plain error logs do not appear in Stackdriver Error Reporting.
        console.error('Environment validation failed.');
        console.error(new Error('Missing required server parameter'));
        return res.status(500).send('Internal Server Error');
      }
      res.send(`Hello ${NAME}!`);
    });
    const port = process.env.PORT || 8080;
    app.listen(port, () => {
      console.log(`hello: listening on port ${port}`);
    });

    Python

    import json
    import os
    
    from flask import Flask
    
    
    app = Flask(__name__)
    
    
    @app.route("/", methods=["GET"])
    def index():
        print("hello: received request.")
    
        NAME = os.getenv("NAME")
    
        if not NAME:
            print("Environment validation failed.")
            raise Exception("Missing required service parameter.")
    
        return f"Hello {NAME}"
    if __name__ == "__main__":
        PORT = int(os.getenv("PORT")) if os.getenv("PORT") else 8080
    
        # This is used when running locally. Gunicorn is used to run the
        # application on Cloud Run. See entrypoint in Dockerfile.
        app.run(host="127.0.0.1", port=PORT, debug=True)

    Go

    
    // Sample hello demonstrates a difficult to troubleshoot service.
    package main
    
    import (
    	"fmt"
    	"log"
    	"net/http"
    	"os"
    )
    
    func main() {
    	log.Print("hello: service started")
    
    	http.HandleFunc("/", helloHandler)
    
    
    	port := os.Getenv("PORT")
    	if port == "" {
    		port = "8080"
    		log.Printf("Defaulting to port %s", port)
    	}
    
    	log.Printf("Listening on port %s", port)
    	log.Fatal(http.ListenAndServe(fmt.Sprintf(":%s", port), nil))
    }
    
    func helloHandler(w http.ResponseWriter, r *http.Request) {
    	log.Print("hello: received request")
    
    	name := os.Getenv("NAME")
    	if name == "" {
    		log.Printf("Missing required server parameter")
    		// The panic stack trace appears in Cloud Error Reporting.
    		panic("Missing required server parameter")
    	}
    
    	fmt.Fprintf(w, "Hello %s!\n", name)
    }
    

    Java

    import static spark.Spark.get;
    import static spark.Spark.port;
    
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    
    public class App {
    
      private static final Logger logger = LoggerFactory.getLogger(App.class);
    
      public static void main(String[] args) {
        int port = Integer.parseInt(System.getenv().getOrDefault("PORT", "8080"));
        port(port);
    
        get(
            "/",
            (req, res) -> {
              logger.info("Hello: received request.");
              String name = System.getenv("NAME");
              if (name == null) {
                // Standard error logs do not appear in Stackdriver Error Reporting.
                System.err.println("Environment validation failed.");
                String msg = "Missing required server parameter";
                logger.error(msg, new Exception(msg));
                res.status(500);
                return "Internal Server Error";
              }
              res.status(200);
              return String.format("Hello %s!", name);
            });
      }
    }

  3. Create a Dockerfile to define the container image used to deploy the service:

    Node.js

    
    # Use the official lightweight Node.js 10 image.
    # https://hub.docker.com/_/node
    FROM node:12-slim
    
    # Create and change to the app directory.
    WORKDIR /usr/src/app
    
    # Copy application dependency manifests to the container image.
    # A wildcard is used to ensure copying both package.json AND package-lock.json (when available).
    # Copying this first prevents re-running npm install on every code change.
    COPY package*.json ./
    
    # Install production dependencies.
    # If you add a package-lock.json, speed your build by switching to 'npm ci'.
    # RUN npm ci --only=production
    RUN npm install --only=production
    
    # Copy local code to the container image.
    COPY . ./
    
    # Run the web service on container startup.
    CMD [ "npm", "start" ]
    

    Python

    
    # Use the official Python image.
    # https://hub.docker.com/_/python
    FROM python:3.9
    
    # Allow statements and log messages to immediately appear in the Cloud Run logs
    ENV PYTHONUNBUFFERED True
    
    # Copy application dependency manifests to the container image.
    # Copying this separately prevents re-running pip install on every code change.
    COPY requirements.txt ./
    
    # Install production dependencies.
    RUN pip install -r requirements.txt
    
    # Copy local code to the container image.
    ENV APP_HOME /app
    WORKDIR $APP_HOME
    COPY . ./
    
    # Run the web service on container startup. 
    # Use gunicorn webserver with one worker process and 8 threads.
    # For environments with multiple CPU cores, increase the number of workers
    # to be equal to the cores available.
    CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
    

    Go

    
    # Use the offical golang image to create a binary.
    # This is based on Debian and sets the GOPATH to /go.
    # https://hub.docker.com/_/golang
    FROM golang:1.15-buster as builder
    
    # Create and change to the app directory.
    WORKDIR /app
    
    # Retrieve application dependencies.
    # This allows the container build to reuse cached dependencies.
    # Expecting to copy go.mod and if present go.sum.
    COPY go.* ./
    RUN go mod download
    
    # Copy local code to the container image.
    COPY . ./
    
    # Build the binary.
    RUN go build -mod=readonly -v -o server
    
    # Use the official Debian slim image for a lean production container.
    # https://hub.docker.com/_/debian
    # https://docs.docker.com/develop/develop-images/multistage-build/#use-multi-stage-builds
    FROM debian:buster-slim
    RUN set -x && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
        ca-certificates && \
        rm -rf /var/lib/apt/lists/*
    
    # Copy the binary to the production image from the builder stage.
    COPY --from=builder /app/server /server
    
    # Run the web service on container startup.
    CMD ["/server"]
    

    Java

    This sample uses Jib to build Docker images using common Java tools. Jib optimizes container builds without the need for a Dockerfile or having Docker installed. Learn more about building Java containers with Jib.

    <plugin>
      <groupId>com.google.cloud.tools</groupId>
      <artifactId>jib-maven-plugin</artifactId>
      <version>2.6.0</version>
      <configuration>
        <to>
          <image>gcr.io/PROJECT_ID/hello-service</image>
        </to>
      </configuration>
    </plugin>
    

Shipping the code

Shipping code consists of three steps: building a container image with Cloud Build, uploading the container image to Container Registry, and deploying the container image to Cloud Run.

To ship your code:

  1. Build your container and publish on Container Registry:

    Node.js

    gcloud builds submit --tag gcr.io/PROJECT_ID/hello-service

    Where PROJECT_ID is your GCP project ID. You can check your current project ID with gcloud config get-value project.

    Upon success, you should see a SUCCESS message containing the ID, creation time, and image name. The image is stored in Container Registry and can be re-used if desired.

    Python

    gcloud builds submit --tag gcr.io/PROJECT_ID/hello-service

    Where PROJECT_ID is your GCP project ID. You can check your current project ID with gcloud config get-value project.

    Upon success, you should see a SUCCESS message containing the ID, creation time, and image name. The image is stored in Container Registry and can be re-used if desired.

    Go

    gcloud builds submit --tag gcr.io/PROJECT_ID/hello-service

    Where PROJECT_ID is your GCP project ID. You can check your current project ID with gcloud config get-value project.

    Upon success, you should see a SUCCESS message containing the ID, creation time, and image name. The image is stored in Container Registry and can be re-used if desired.

    Java

    mvn compile jib:build -Dimage=gcr.io/PROJECT_ID/hello-service

    Where PROJECT_ID is your GCP project ID. You can check your current project ID with gcloud config get-value project.

    Upon success, you should see a BUILD SUCCESS message. The image is stored in Container Registry and can be re-used if desired.

  2. Run the following command to deploy your app:

    gcloud run deploy hello-service --image gcr.io/PROJECT_ID/hello-service

    Replace PROJECT_ID with your GCP project ID. hello-service is both the container image name and name of the Cloud Run service. Notice that the container image is deployed to the service and region (Cloud Run) or cluster (Cloud Run for Anthos on Google Cloud) that you configured previously under Setting up gcloud

    If deploying to Cloud Run (fully managed), respond y, "Yes", to the "allow unauthenticated" prompt. See Managing Access for more details on IAM-based authentication.

    Wait until the deployment is complete: this can take about half a minute. On success, the command line displays the service URL.

Trying it out

Try out the service to confirm you have successfully deployed it. Requests should fail with a HTTP 500 or 503 error (members of the class 5xx Server errors). The tutorial walks through troubleshooting this error response.

  • For Cloud Run (fully managed), the service is auto-assigned a navigable URL.

    Navigate to this URL with your web browser:

    1. Open a web browser

    2. Find the service URL output by the earlier deploy command.

      If the deploy command did not provide a URL then something went wrong. Review the error message and act accordingly: if no actionable guidance is present, review the troubleshooting guide and possibly retry the deployment command.

    3. Navigate to this URL by copying it into your browser's address bar and pressing ENTER.

    4. See the HTTP 500 or HTTP 503 error.

    If you receive a HTTP 403 error, you may have rejected allow unauthenticated invocations at the deployment prompt. Grant unauthenticated access to the service to fix this:

    gcloud run services add-iam-policy-binding hello-service \
      --member="allUsers" \
      --role="roles/run.invoker"
    

    For more information, read Allowing public (unauthenticated) access.

  • For Cloud Run for Anthos on Google Cloud, if you don't use automatic TLS certificates and domain mapping you are not provided a navigable URL for your service.

    Instead, use the provided URL and the IP address of the service's ingress gateway to create a curl command that can make requests to your service:

    1. To get the external IP for the Istio ingress gateway:
      kubectl get svc ISTIO-GATEWAY -n NAMESPACE 
      Replace ISTIO-GATEWAY and NAMESPACE as follows:
      Cluster version ISTIO-GATEWAY NAMESPACE
      1.15.3-gke.19 and greater
      1.14.3-gke.12 and greater
      1.13.10-gke.8 and greater
      istio-ingress gke-system
      All other versions istio-ingressgateway istio-system
      where the resulting output looks something like this:
      NAME            TYPE           CLUSTER-IP     EXTERNAL-IP  PORT(S)
      ISTIO-GATEWAY    LoadBalancer   XX.XX.XXX.XX   pending     80:32380/TCP,443:32390/TCP,32400:32400/TCP
      
      The EXTERNAL-IP for the Load Balancer is the IP address you must use.
    2. Run a curl command using this GATEWAY_IP address in the URL.

      curl -G -H "Host: SERVICE-DOMAIN" https://EXTERNAL-IP/

      Replace SERVICE-DOMAIN with the default assigned domain of your service. You can obtain this by taking the default URL and removing the protocol http://.

    3. See the HTTP 500 or HTTP 503 error.

    If your cluster is configured with a routable default domain, skip the steps above and instead copy the URL into your web browser.

Investigating the problem

Visualize that the HTTP 5xx error encountered above in Trying it out was encountered as a production runtime error. This tutorial walks through a formal process for handling it. Although production error resolution processes vary widely, this tutorial presents a particular sequence of steps to show the application of useful tools and techniques.

To investigate this problem you will work through these phases:

  • Collect more details on the reported error to support further investigation and set a mitigation strategy.
  • Relieve user impact by deciding to push forward in a fix or rollback to a known-healthy version.
  • Reproduce the error to confirm the correct details have been gathered and that the error is not a one-time glitch
  • Perform a root cause analysis on the bug to find the code, configuration, or process which created this error

At the start of the investigation you have a URL, timestamp, and the message "Internal Server Error".

Gathering further details

Gather more information about the problem to understand what happened and determine next steps.

Use available Stackdriver tools to collect more details:

  1. Use the Error Reporting console, which provides a dashboard with details and recurrence tracking for errors with a recognized stack trace.

    Go to Error Reporting console

    Screenshot of the error list including columnns 'Resolution Status', Occurrences, Error, and 'Seen in'.
    List of recorded errors. Errors are grouped by message across revisions, services, and platforms.
  2. Click on the error to see the stack trace details, noting the function calls made just prior to the error.

    Screenshot of a single parsed stack trace, demonstrating a common profile of this error.
    The "Stack trace sample" in the error details page shows a single instance of the error. You can review each individual instances.
  3. Use Cloud Logging to review the sequence of operations leading to the problem, including error messages that are not included in the Error Reporting console because of a lack of a recognized error stack trace:

    Go to Cloud Logging console

    • If using Cloud Run (fully managed), select Cloud Run Revision > hello-service from the first drop-down box. This will filter the log entries to those generated by your service.

    • If using Cloud Run for Anthos on Google Cloud select Kubernetes Container > hello-service from the first drop-down box.

    Read more about viewing logs in Cloud Run

Rollback to a healthy version

If this is an established service, known to work, there will be a previous revision of the service on Cloud Run. This tutorial uses a new service with no previous versions, so you cannot do a rollback.

However, if you have a service with previous versions you can roll back to, follow Viewing revision details to extract the container name and configuration details necessary to create a new working deployment of your service.

Reproducing the error

Using the details you obtained previously, confirm the problem consistently occurs under test conditions.

Send the same HTTP request by trying it out again, and see if the same error and details are reported. It may take some time for error details to show up.

Because the sample service in this tutorial is read-only and doesn't trigger any complicating side effects, reproducing errors in production is safe. However, for many real services, this won't be the case: you may need to reproduce errors in a test environment or limit this step to local investigation.

Reproducing the error establishes the context for further work. For example, if developers cannot reproduce the error further investigation may require additional instrumentation of the service.

Performing a root cause analysis

Root cause analysis is an important step in effective troubleshooting to ensure you fix the problem instead of a symptom.

Previously in this tutorial, you reproduced the problem on Cloud Run which confirms the problem is active when the service is hosted on Cloud Run. Now reproduce the problem locally to determine if the problem is isolated to the code or if it only emerges in production hosting.

  1. If you have not used Docker CLI locally with Container Registry, authenticate it with gcloud:

    gcloud auth configure-docker

    For alternative approaches see Container Registry authentication methods.

  2. If the most recently used container image name is not available, the service description has the information of the most recently deployed container image:

    gcloud run services describe hello-service

    Find the container image name inside the spec object. A more targeted command can directly retrieve it:

    gcloud run services describe hello-service \
       --format="value(spec.template.spec.containers.image)"

    This command reveals a container image name such as gcr.io/PROJECT_ID/hello-service.

  3. Pull the container image from the Container Registry to your environment, this step might take several minutes as it downloads the container image:

    docker pull gcr.io/PROJECT_ID/hello-service

    Later updates to the container image that reuse this name can be retrieved with the same command. If you skip this step, the docker run command below pulls a container image if one is not present on the local machine.

  4. Run locally to confirm the problem is not unique to Cloud Run:

    PORT=8080 && docker run --rm -e PORT=$PORT -p 9000:$PORT \
       gcr.io/PROJECT_ID/hello-service

    Breaking down the elements of the command above,

    • The PORT environment variable is used by the service to determine the port to listen on inside the container.
    • The run command starts the container, defaulting to the entrypoint command defined in the Dockerfile or a parent container image.
    • The --rm flag deletes the container instance on exit.
    • The -e flag assigns a value to an environment variable. -e PORT=$PORT is propagating the PORT variable from the local system into the container with the same variable name.
    • The -p flag publishes the container as a service available on localhost at port 9000. Requests to localhost:9000 will be routed to the container on port 8080. This means output from the service about the port number in use will not match how the service is accessed.
    • The final argument gcr.io/PROJECT_ID/hello-service is a container image tag, a human-readable label for a container image's sha256 hash identifier. If not available locally, docker attempts to retrieve the image from a remote registry.

    In your browser, open http://localhost:9000. Check the terminal output for error messages that match those on Google Cloud's operations suite.

    If the problem is not reproducible locally, it may be unique to the Cloud Run environment. Review the Cloud Run troubleshooting guide for specific areas to investigate.

    In this case the error is reproduced locally.

Now that the error is doubly-confirmed as persistent and caused by the service code instead of the hosting platform, it's time to investigate the code more closely.

For purposes of this tutorial it is safe to assume the code inside the container and the code in the local system is identical.

Revisit the error report's stack trace and cross-reference with the code to find the specific lines at fault.

Node.js

Find the source of the error message in the file index.js around the line number called out in the stack trace shown in the logs:
const {NAME} = process.env;
if (!NAME) {
  // Plain error logs do not appear in Stackdriver Error Reporting.
  console.error('Environment validation failed.');
  console.error(new Error('Missing required server parameter'));
  return res.status(500).send('Internal Server Error');
}

Python

Find the source of the error message in the file main.py around the line number called out in the stack trace shown in the logs:
NAME = os.getenv("NAME")

if not NAME:
    print("Environment validation failed.")
    raise Exception("Missing required service parameter.")

Go

Find the source of the error message in the file main.go around the line number called out in the stack trace shown in the logs:

name := os.Getenv("NAME")
if name == "" {
	log.Printf("Missing required server parameter")
	// The panic stack trace appears in Cloud Error Reporting.
	panic("Missing required server parameter")
}

Java

Find the source of the error message in the file App.java around the line number called out in the stack trace shown in the logs:

String name = System.getenv("NAME");
if (name == null) {
  // Standard error logs do not appear in Stackdriver Error Reporting.
  System.err.println("Environment validation failed.");
  String msg = "Missing required server parameter";
  logger.error(msg, new Exception(msg));
  res.status(500);
  return "Internal Server Error";
}

Examining this code, the following actions are taken when the NAME environment variable is not set:

  • An error is logged to Google Cloud's operations suite
  • An HTTP error response is sent

The problem is caused by a missing variable, but the root cause is more specific: the code change adding the hard dependency on an environment variable did not include related changes to deployment scripts and runtime requirements documentation.

Fixing the root cause

Now that we have collected the code and identified the potential root cause, we can take steps to fix it.

  • Check whether the service works locally with the NAME environment available in place:

    1. Run the container locally with the environment variable added:

      PORT=8080 && docker run --rm -e PORT=$PORT -p 9000:$PORT \
       -e NAME="Local World!" \
       gcr.io/PROJECT_ID/hello-service
    2. Navigate your browser to http://localhost:9000

    3. See "Hello Local World!" appear on the page

  • Modify the running Cloud Run service environment to include this variable:

    1. Run the services update command to add an environment variable:

      gcloud run services update hello-service \
        --set-env-vars NAME=Override
      
    2. Wait a few seconds while Cloud Run creates a new revision based on the previous revision with the new environment variable added.

  • Confirm the service is now fixed:

    1. Navigate your browser to the Cloud Run service URL.
    2. See "Hello Override!" appear on the page.
    3. Verify that no unexpected messages or errors appear in Cloud Logging or Error Reporting.

Improving future troubleshooting speed

In this sample production problem, the error was related to operational configuration. There are code changes that will minimize the impact of this problem in the future.

  • Improve the error log to include more specific details.
  • Instead of returning an error, have the service fall back to a safe default. If using a default represents a change to normal functionality, use a warning message for monitoring purposes.

Let's step through removing the NAME environment variable as a hard dependency.

  1. Remove the existing NAME-handling code:

    Node.js

    const {NAME} = process.env;
    if (!NAME) {
      // Plain error logs do not appear in Stackdriver Error Reporting.
      console.error('Environment validation failed.');
      console.error(new Error('Missing required server parameter'));
      return res.status(500).send('Internal Server Error');
    }

    Python

    NAME = os.getenv("NAME")
    
    if not NAME:
        print("Environment validation failed.")
        raise Exception("Missing required service parameter.")

    Go

    name := os.Getenv("NAME")
    if name == "" {
    	log.Printf("Missing required server parameter")
    	// The panic stack trace appears in Cloud Error Reporting.
    	panic("Missing required server parameter")
    }

    Java

    String name = System.getenv("NAME");
    if (name == null) {
      // Standard error logs do not appear in Stackdriver Error Reporting.
      System.err.println("Environment validation failed.");
      String msg = "Missing required server parameter";
      logger.error(msg, new Exception(msg));
      res.status(500);
      return "Internal Server Error";
    }

  2. Add new code that sets a fallback value:

    Node.js

    const NAME = process.env.NAME || 'World';
    if (!process.env.NAME) {
      console.log(
        JSON.stringify({
          severity: 'WARNING',
          message: `NAME not set, default to '${NAME}'`,
        })
      );
    }

    Python

    NAME = os.getenv("NAME")
    
    if not NAME:
        NAME = "World"
        error_message = {
            "severity": "WARNING",
            "message": f"NAME not set, default to {NAME}",
        }
        print(json.dumps(error_message))

    Go

    name := os.Getenv("NAME")
    if name == "" {
    	name = "World"
    	log.Printf("warning: NAME not set, default to %s", name)
    }

    Java

    String name = System.getenv().getOrDefault("NAME", "World");
    if (System.getenv("NAME") == null) {
      logger.warn(String.format("NAME not set, default to %s", name));
    }

  3. Test locally by re-building and running the container through the affected configuration cases:

    Node.js

    docker build --tag gcr.io/PROJECT_ID/hello-service .

    Python

    docker build --tag gcr.io/PROJECT_ID/hello-service .

    Go

    docker build --tag gcr.io/PROJECT_ID/hello-service .

    Java

    mvn compile jib:build

    Confirm the NAME environment variable still works:

    PORT=8080 && docker run --rm -e $PORT -p 9000:$PORT \
     -e NAME="Robust World" \
     gcr.io/PROJECT_ID/hello-service

    Confirm the service works without the NAME variable:

    PORT=8080 && docker run --rm -e $PORT -p 9000:$PORT \
     gcr.io/PROJECT_ID/hello-service

    If the service does not return a result, confirm the removal of code in the first step did not remove extra lines, such as those used to write the response.

  4. Deploy this by revisiting the Deploy your code section.

    Each deployment to a service creates a new revision and automatically starts serving traffic when ready.

    To clear the environment variables set earlier:

    gcloud run services update hello-service --clear-env-vars

Add the new functionality for the default value to automated test coverage for the service.

Finding other issues in the logs

You may see other issues in the Log Viewer for this service. For example, an unsupported system call will appear in the logs as a "Container Sandbox Limitation".

For example, the Node.js services sometimes result in this log message:

Container Sandbox Limitation: Unsupported syscall statx(0xffffff9c,0x3e1ba8e86d88,0x0,0xfff,0x3e1ba8e86970,0x3e1ba8e86a90). Please, refer to https://gvisor.dev/c/linux/amd64/statx for more information.

In this case, the lack of support does not impact the hello-service sample service.

Cleaning up

If you created a new project for this tutorial, delete the project. If you used an existing project and wish to keep it without the changes added in this tutorial, delete resources created for the tutorial.

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

Deleting tutorial resources

  1. Delete the Cloud Run service you deployed in this tutorial:

    gcloud run services delete SERVICE-NAME

    Where SERVICE-NAME is your chosen service name.

    You can also delete Cloud Run services from the Google Cloud Console.

  2. Remove the gcloud default configurations you added during tutorial setup.

    If you use Cloud Run (fully managed), remove the region setting:

     gcloud config unset run/region
    

    If you use Cloud Run for Anthos on Google Cloud, remove the cluster configuration:

     gcloud config unset run/cluster run/cluster
     gcloud config unset run/cluster run/cluster_location
    
  3. Remove the project configuration:

     gcloud config unset project
    
  4. Delete other Google Cloud resources created in this tutorial:

What's next