# Run these commands on the virtual machine
cos-extensionsinstallgpu
sudomount--bind/var/lib/nvidia/var/lib/nvidia
sudomount-oremount,exec/var/lib/nvidia
/var/lib/nvidia/bin/nvidia-smi
defcheck_if_gpus_present(element):importtorchimporttensorflowastftensorflow_detects_gpus=tf.config.list_physical_devices("GPU")torch_detects_gpus=torch.cuda.is_available()iftensorflow_detects_gpusandtorch_detects_gpus:returnelementiftensorflow_detects_gpus:raiseException('PyTorch failed to detect GPUs with your setup')iftorch_detects_gpus:raiseException('Tensorflow failed to detect GPUs with your setup')raiseException('Both Tensorflow and PyTorch failed to detect GPUs with your setup')withbeam.Pipeline()asp:_=(p|beam.Create([1,2,3])# Create a PCollection of the prompts.|beam.Map(check_if_gpus_present))
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-18。"],[[["\u003cp\u003eDataflow jobs using GPUs will incur charges as specified on the Dataflow pricing page and must use Dataflow Runner v2.\u003c/p\u003e\n"],["\u003cp\u003eWhen encountering issues, it is important to follow best practices for working with Dataflow GPUs to ensure correct pipeline configuration and verify that the Dataflow job is actively utilizing GPUs.\u003c/p\u003e\n"],["\u003cp\u003eDebugging a job with a standalone VM is often faster, involving creating a Compute Engine VM with GPUs, installing drivers, and launching your custom container.\u003c/p\u003e\n"],["\u003cp\u003eIf debugging on a standalone VM is not possible, Dataflow can be used to debug by running a simplified pipeline to detect GPU presence, and it is important to incrementally add more code to isolate the problem.\u003c/p\u003e\n"],["\u003cp\u003eIf no GPUs are detected, verify the NVIDIA libraries, along with compatibility between TensorFlow, cuDNN, and CUDA, and check for known issues with the version of TensorFlow and CUDA being used.\u003c/p\u003e\n"]]],[],null,["\u003cbr /\u003e\n\n| **Note:** The following considerations apply to this GA offering:\n|\n| - Jobs that use GPUs incur charges as specified in the Dataflow [pricing page](/dataflow/pricing).\n| - To use GPUs, your Dataflow job must use [Dataflow Runner v2](/dataflow/docs/runner-v2).\n\n\u003cbr /\u003e\n\nIf you encounter problems running your Dataflow job with GPUs,\nfollow these steps:\n\n1. Follow the workflow in [Best practices for working with Dataflow GPUs](/dataflow/docs/gpu/develop-with-gpus) to ensure that your pipeline is configured correctly.\n2. Confirm that your Dataflow job is using GPUs. See [Verify your Dataflow job](/dataflow/docs/gpu/use-gpus#verify) in \"Run a pipeline with GPUs.\"\n3. [Debug your job](#debug-job), either with a standalone VM or by using Dataflow.\n4. If the problem persists, follow the rest of the troubleshooting steps on this page.\n\nDebug your job\n\nIf possible, [debug your job with a standalone VM](#debug-vm), because debugging\nwith a standalone VM is usually faster. However, if organizational policies\nprevent you from debugging with a standalone VM, you can\n[debug by using Dataflow](#debug-df).\n\nDebug with a standalone VM\n\nWhile you're designing and iterating on a container image that works for you,\nit can be faster to reduce the feedback loop by trying out your container image\non a standalone VM.\n\nYou can debug your custom container on a standalone VM with GPUs by creating a\nCompute Engine VM running GPUs on Container-Optimized OS,\ninstalling drivers, and starting your container as follows.\n\n1. Create a VM instance.\n\n gcloud compute instances create \u003cvar translate=\"no\"\u003eINSTANCE_NAME\u003c/var\u003e \\\n --project \"\u003cvar translate=\"no\"\u003ePROJECT\u003c/var\u003e\" \\\n --image-family cos-stable \\\n --image-project=cos-cloud \\\n --zone=us-central1-f \\\n --accelerator type=nvidia-tesla-t4,count=1 \\\n --maintenance-policy TERMINATE \\\n --restart-on-failure \\\n --boot-disk-size=200G \\\n --scopes=cloud-platform\n\n2. Use `ssh` to connect to the VM.\n\n gcloud compute ssh \u003cvar translate=\"no\"\u003eINSTANCE_NAME\u003c/var\u003e --project \"\u003cvar translate=\"no\"\u003ePROJECT\u003c/var\u003e\"\n\n3. Install the GPU drivers. After connecting to the VM by using `ssh`, run the\n following commands on the VM:\n\n # Run these commands on the virtual machine\n cos-extensions install gpu\n sudo mount --bind /var/lib/nvidia /var/lib/nvidia\n sudo mount -o remount,exec /var/lib/nvidia\n /var/lib/nvidia/bin/nvidia-smi\n\n4. Launch your custom container.\n\n Apache Beam SDK containers use the `/opt/apache/beam/boot` entrypoint. For\n debugging purposes you can launch your container manually with a different\n entrypoint: \n\n docker-credential-gcr configure-docker\n docker run --rm \\\n -it \\\n --entrypoint=/bin/bash \\\n --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \\\n --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin \\\n --privileged \\\n \u003cvar translate=\"no\"\u003eIMAGE\u003c/var\u003e\n\n Replace \u003cvar translate=\"no\"\u003eIMAGE\u003c/var\u003e with the Artifact Registry path for your Docker image.\n5. Verify that the GPU libraries installed in your container can access the\n GPU devices.\n\n If you're using TensorFlow, you can\n print available devices in Python interpreter with the following: \n\n \u003e\u003e\u003e import tensorflow as tf\n \u003e\u003e\u003e print(tf.config.list_physical_devices(\"GPU\"))\n\n If you're using PyTorch, you can\n inspect available devices in Python interpreter with the following: \n\n \u003e\u003e\u003e import torch\n \u003e\u003e\u003e print(torch.cuda.is_available())\n \u003e\u003e\u003e print(torch.cuda.device_count())\n \u003e\u003e\u003e print(torch.cuda.get_device_name(0))\n\nTo iterate on your pipeline, you can launch your pipeline on Direct Runner. You\ncan also launch pipelines on Dataflow Runner from this environment.\n\nDebug by using Dataflow\n\nIf organizational constraints prevent you from debugging on a standalone VM,\nyou can debug by using Dataflow.\n\nSimplify your pipeline so that all it does is detect whether GPUs are\npresent, and then run the pipeline on Dataflow. The following\nexample demonstrates what the code for this pipeline might look like: \n\n def check_if_gpus_present(element):\n import torch\n import tensorflow as tf\n\n tensorflow_detects_gpus = tf.config.list_physical_devices(\"GPU\")\n torch_detects_gpus = torch.cuda.is_available()\n if tensorflow_detects_gpus and torch_detects_gpus:\n return element\n\n if tensorflow_detects_gpus:\n raise Exception('PyTorch failed to detect GPUs with your setup')\n if torch_detects_gpus:\n raise Exception('Tensorflow failed to detect GPUs with your setup')\n raise Exception('Both Tensorflow and PyTorch failed to detect GPUs with your setup')\n\n with beam.Pipeline() as p:\n _ = (p | beam.Create([1,2,3]) # Create a PCollection of the prompts.\n | beam.Map(check_if_gpus_present)\n )\n\nIf your pipeline succeeds, your code is able to access GPUs. To identify the\nproblem code, gradually insert progressively larger examples into your pipeline\ncode, running your pipeline after each change.\n\nIf your pipeline fails to detect GPUs, follow the steps in the\n[No GPU usage](#no-gpu-usage) section of this document.\n\nWorkers don't start\n\nIf your job is stuck and the Dataflow workers never start\nprocessing data, it's likely that you have a problem related to using a custom\ncontainer with Dataflow. For more details, read\nthe [custom containers troubleshooting guide](/dataflow/docs/guides/troubleshoot-custom-container).\n\nIf you're a Python user, verify that the following conditions are met:\n\n- The Python interpreter minor version in your container image is the same version as you use when launching your pipeline. If there's a mismatch, you might see errors like [`SystemError: unknown opcode`](/dataflow/docs/guides/common-errors#custom-container-python-version) with a stack trace involving `apache_beam/internal/pickler.py`.\n- If you're using the Apache Beam SDK 2.29.0 or earlier, `pip` must be accessible on the image in `/usr/local/bin/pip`.\n\nWe recommend that you reduce the customizations to a minimal working\nconfiguration the first time you use a custom image. Use the sample custom\ncontainer images provided in the examples on this page. Make sure you\ncan run a straightforward Dataflow pipeline with this container\nimage without requesting GPUs. Then, iterate on the solution.\n\nVerify that workers have sufficient disk space to download your container\nimage. Adjust disk size if necessary. Large images take longer to\ndownload, which increases worker startup time.\n\nJob fails immediately at startup\n\nIf you encounter the\n[`ZONE_RESOURCE_POOL_EXHAUSTED`](/compute/docs/troubleshooting/troubleshooting-vm-creation#resource_availability)\nor [`ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS`](/compute/docs/troubleshooting/troubleshooting-vm-creation#resource_availability) errors, you can take the following steps:\n\n- Don't specify the worker zone so that Dataflow selects the\n optimal zone for you.\n\n- Launch the pipeline in a different zone or with a different accelerator type.\n\nJob fails at runtime\n\nIf the job fails at runtime, check for out of memory (OOM) errors on the worker\nmachine and on the GPU. GPU OOM errors may manifest as\n`cudaErrorMemoryAllocation out of memory` errors in worker logs. If you're\nusing TensorFlow, verify that you use only one\nTensorFlow process to access one GPU device.\nFor more information, read [GPUs and worker parallelism](/dataflow/docs/concepts/gpu-support#gpus_and_worker_parallelism).\n\nNo GPU usage\n\nIf your job doesn't appear to be using GPUs, follow the steps in the\n[Debug your job](#debug-job) section of this document to verify whether GPUs are\navailable with your Docker image.\n\nIf GPUs are available but not used, the problem is likely with the pipeline code.\nTo debug the pipeline code, start with a straightforward pipeline that successfully\nuses GPUs, and then gradually add code to the pipeline, testing the pipeline\nwith each new addition. For more information, see the\n[Debug on Dataflow](#debug-df) section of this document.\n\nIf your pipeline fails to detect GPUs, verify the following:\n\n- NVIDIA libraries installed in the container image match the requirements of pipeline user code and libraries that it uses.\n- Installed NVIDIA libraries in container images are accessible as shared libraries.\n\nIf the devices are not available, you might be using an incompatible software\nconfiguration. To verify the image configuration, run a straightforward pipeline\nthat just checks that GPUs are available and accessible to the workers.\n\nTroubleshoot TensorFlow issues\n\nIf PyTorch detects GPUs in your pipeline but TensorFlow doesn't,\ntry the following troubleshooting steps:\n\n- Verify that you have a compatible combination of TensorFlow, cuDNN version, and CUDA Toolkit version. For more information, see [Tested build configurations](https://www.tensorflow.org/install/source#gpu) in the TensorFlow documentation.\n- If possible, upgrade to the latest compatible TensorFlow and CUDA versions.\n- Review the known issues for TensorFlow and CUDA to verify whether a known is causing problems in your pipeline. For example, the following known issue could prevent TensorFlow from detecting GPUs: [TF 2.17.0 RC0 Fails to work with GPUs](https://github.com/tensorflow/tensorflow/issues/63362).\n\nWhat's next\n\n- [Getting started: Running GPUs on Container-Optimized\n OS](/container-optimized-os/docs/how-to/run-gpus#getting_started_running_gpus_on).\n- [Container-Optimized OS toolbox](/container-optimized-os/docs/how-to/toolbox).\n- [Service account access scopes](/compute/docs/access/service-accounts#accesscopesiam)."]]