Diese Seite wurde von der Cloud Translation API übersetzt.
Switch to English

GPUs Clustern hinzufügen

Dataproc bietet die Möglichkeit, Grafikprozessoren (Graphics Processing Units, GPUs) zu den Master- und Worker-Knoten für Google Compute Engine in einem Dataproc-Cluster hinzuzufügen. Sie können diese GPUs verwenden, um bestimmte Arbeitslasten wie maschinelles Lernen und Datenverarbeitung in Ihren Instanzen zu beschleunigen.

Weitere Informationen zu den Vorteilen von GPUs und den unterschiedlichen Arten von GPU-Hardware finden Sie unter GPUs in Compute Engine.

Hinweise

  • GPUs erfordern spezielle Treiber und Software. Diese Elemente sind auf Dataproc-Clustern nicht vorinstalliert.
  • Lesen Sie mehr über die GPU-Preisgestaltung für Compute Engine, um zu erfahren, mit welchen Kosten die Nutzung von GPUs in Ihren Instanzen verbunden ist.
  • Lesen Sie mehr über die Einschränkungen für Instanzen mit GPUs, um zu erfahren, wie sich diese Instanzen von Nicht-GPU-Instanzen unterscheiden.
  • Prüfen Sie auf der Seite "Kontingente" für Ihr Projekt, ob ausreichend GPU-Kontingente (NVIDIA_K80_GPUS, NVIDIA_P100_GPUS oder NVIDIA_V100_GPUS) vorhanden sind. Wenn auf der Seite "Kontingente" keine GPUs aufgeführt sind oder Sie zusätzliche GPU-Kontingente benötigen, können Sie eine Erhöhung des Kontingents beantragen.

GPU-Typen

Dataproc-Knoten unterstützen die folgenden GPU-Typen. Sie müssen den GPU-Typ angeben, wenn Sie Ihrem Dataproc-Cluster GPUs hinzufügen.

  • nvidia-tesla-k80 – NVIDIA® Tesla® K80
  • nvidia-tesla-p100 – NVIDIA® Tesla® P100
  • nvidia-tesla-v100 – NVIDIA® Tesla® V100
  • nvidia-tesla-p4 – NVIDIA® Tesla® P4
  • nvidia-tesla-t4 – NVIDIA® Tesla® T4
  • nvidia-tesla-p100-vws – NVIDIA® Tesla® P100 Virtual Workstations
  • nvidia-tesla-p4-vws – NVIDIA® Tesla® P4 Virtual Workstations
  • nvidia-tesla-t4-vws – NVIDIA® Tesla® T4 Virtual Workstations

GPUs Clustern hinzufügen

gcloud

Fügen Sie dem Master-, den primären und den sekundären Worker-Knoten in einem Dataproc-Cluster GPUs hinzu, wenn Sie den Cluster mit den Flags ‑‑master-accelerator, ‑‑worker-accelerator und ‑‑secondary-worker-accelerator erstellen. Diese Flags verwenden zwei Werte:

  1. den GPU-Typ, der dem Knoten hinzugefügt werden soll
  2. die Anzahl der GPUs, die dem Knoten hinzugefügt werden sollen

Der GPU-Typ ist erforderlich, die Anzahl der GPUs ist optional (der Standardwert ist 1 GPU).

Beispiel:

gcloud dataproc clusters create cluster-name \
    --region=region \
    --master-accelerator type=nvidia-tesla-k80 \
    --worker-accelerator type=nvidia-tesla-k80,count=4 \
    --secondary-worker-accelerator type=nvidia-tesla-k80,count=4 \
    ... other flags

Damit Sie GPUs in Ihrem Cluster verwenden können, müssen Sie GPU-Treiber installieren.

REST API

Fügen Sie dem Haupt-, den primären und den Worker-Knoten auf Abruf in einem Dataproc-Cluster GPUs hinzu, indem Sie die Felder InstanceGroupConfig.AcceleratorConfig acceleratorTypeUri und acceleratorCount als Teil der cluster.create-API-Anfrage ausfüllen.

Console

Klicken Sie in der Cloud Console auf der Seite Cluster erstellen im Abschnitt „Knoten konfigurieren“ auf CPU PLATFORM AND GPU→GPUs→ADD GPUs, um die Anzahl der GPUs und den GPU-Typ für die Knoten anzugeben.

GPU-Treiber installieren

Zum Verwenden von GPUs, die Dataproc-Knoten hinzugefügt wurden, sind GPU-Treiber erforderlich. Zum Installieren von GPU-Treibern folgen Sie den Anleitungen für diese Initialisierungsaktion, die unten aufgeführt ist.

#!/bin/bash
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS-IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This script installs NVIDIA GPU drivers and collects GPU utilization metrics.

set -euxo pipefail

function get_metadata_attribute() {
  local -r attribute_name=$1
  local -r default_value=$2
  /usr/share/google/get_metadata_value "attributes/${attribute_name}" || echo -n "${default_value}"
}

OS_NAME=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
readonly OS_NAME
OS_DIST=$(lsb_release -cs)
readonly OS_DIST

# Dataproc role
ROLE="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
readonly ROLE

# CUDA Version
CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '10.2')
readonly CUDA_VERSION

# Parameters for NVIDIA-provided Debian GPU driver
readonly DEFAULT_NVIDIA_DEBIAN_GPU_DRIVER_VERSION='460.56'
readonly DEFAULT_NVIDIA_DEBIAN_GPU_DRIVER_URL="https://us.download.nvidia.com/XFree86/Linux-x86_64/${DEFAULT_NVIDIA_DEBIAN_GPU_DRIVER_VERSION}/NVIDIA-Linux-x86_64-${DEFAULT_NVIDIA_DEBIAN_GPU_DRIVER_VERSION}.run"
NVIDIA_DEBIAN_GPU_DRIVER_URL=$(get_metadata_attribute 'gpu-driver-url' "${DEFAULT_NVIDIA_DEBIAN_GPU_DRIVER_URL}")
readonly NVIDIA_DEBIAN_GPU_DRIVER_URL

readonly NVIDIA_BASE_DL_URL='https://developer.download.nvidia.com/compute'

readonly -A DEFAULT_NVIDIA_DEBIAN_CUDA_URLS=(
  [10.1]="${NVIDIA_BASE_DL_URL}/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run"
  [10.2]="${NVIDIA_BASE_DL_URL}/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run"
  [11.0]="${NVIDIA_BASE_DL_URL}/cuda/11.0.3/local_installers/cuda_11.0.3_450.51.06_linux.run"
  [11.1]="${NVIDIA_BASE_DL_URL}/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run")
readonly DEFAULT_NVIDIA_DEBIAN_CUDA_URL=${DEFAULT_NVIDIA_DEBIAN_CUDA_URLS["${CUDA_VERSION}"]}
NVIDIA_DEBIAN_CUDA_URL=$(get_metadata_attribute 'cuda-url' "${DEFAULT_NVIDIA_DEBIAN_CUDA_URL}")
readonly NVIDIA_DEBIAN_CUDA_URL

# Parameters for NVIDIA-provided Ubuntu GPU driver
readonly NVIDIA_UBUNTU_REPOSITORY_URL="${NVIDIA_BASE_DL_URL}/cuda/repos/ubuntu1804/x86_64"
readonly NVIDIA_UBUNTU_REPOSITORY_KEY="${NVIDIA_UBUNTU_REPOSITORY_URL}/7fa2af80.pub"
readonly NVIDIA_UBUNTU_REPOSITORY_CUDA_PIN="${NVIDIA_UBUNTU_REPOSITORY_URL}/cuda-ubuntu1804.pin"

# Parameters for NVIDIA-provided NCCL library
readonly DEFAULT_NCCL_REPO_URL="${NVIDIA_BASE_DL_URL}/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb"
NCCL_REPO_URL=$(get_metadata_attribute 'nccl-repo-url' "${DEFAULT_NCCL_REPO_URL}")
readonly NCCL_REPO_URL
NCCL_VERSION=$(get_metadata_attribute 'nccl-version' '2.7.8')
readonly NCCL_VERSION

# Parameters for NVIDIA-provided CUDNN library
readonly CUDNN_VERSION=$(get_metadata_attribute 'cudnn-version' '')
readonly CUDNN_TARBALL="cudnn-${CUDA_VERSION}-linux-x64-v${CUDNN_VERSION}.tgz"
readonly CUDNN_TARBALL_URL="http://developer.download.nvidia.com/compute/redist/cudnn/v${CUDNN_VERSION%.*}/${CUDNN_TARBALL}"

# Whether to install NVIDIA-provided or OS-provided GPU driver
GPU_DRIVER_PROVIDER=$(get_metadata_attribute 'gpu-driver-provider' 'NVIDIA')
readonly GPU_DRIVER_PROVIDER

# Stackdriver GPU agent parameters
readonly GPU_AGENT_REPO_URL='https://raw.githubusercontent.com/GoogleCloudPlatform/ml-on-gcp/master/dlvm/gcp-gpu-utilization-metrics'
# Whether to install GPU monitoring agent that sends GPU metrics to Stackdriver
INSTALL_GPU_AGENT=$(get_metadata_attribute 'install-gpu-agent' 'false')
readonly INSTALL_GPU_AGENT

# Dataproc configurations
readonly HADOOP_CONF_DIR='/etc/hadoop/conf'
readonly HIVE_CONF_DIR='/etc/hive/conf'
readonly SPARK_CONF_DIR='/etc/spark/conf'

function execute_with_retries() {
  local -r cmd=$1
  for ((i = 0; i < 10; i++)); do
    if eval "$cmd"; then
      return 0
    fi
    sleep 5
  done
  return 1
}

function install_nvidia_nccl() {
  local tmp_dir
  tmp_dir=$(mktemp -d -t gpu-init-action-nccl-XXXX)

  curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
    "${NCCL_REPO_URL}" -o "${tmp_dir}/nvidia-ml-repo.deb"
  dpkg -i "${tmp_dir}/nvidia-ml-repo.deb"

  execute_with_retries "apt-get update"

  local -r nccl_version="${NCCL_VERSION}-1+cuda${CUDA_VERSION}"
  execute_with_retries \
    "apt-get install -y --allow-unauthenticated libnccl2=${nccl_version} libnccl-dev=${nccl_version}"
}

function install_nvidia_cudnn() {
  if [[ ${OS_NAME} == ubuntu ]]; then
    local major_version
    major_version="${CUDNN_VERSION%%.*}"
    local cudnn_pkg_version
    cudnn_pkg_version="${CUDNN_VERSION}-1+cuda${CUDA_VERSION}"
    local -a packages
    packages=(
      "libcudnn${major_version}=${cudnn_pkg_version}"
      "libcudnn${major_version}-dev=${cudnn_pkg_version}")
    execute_with_retries \
      "apt-get install -y --no-install-recommends ${packages[*]}"
  else
    local tmp_dir
    tmp_dir=$(mktemp -d -t gpu-init-action-cudnn-XXXX)

    curl -fSsL --retry-connrefused --retry 10 --retry-max-time 30 \
      "${CUDNN_TARBALL_URL}" -o "${tmp_dir}/${CUDNN_TARBALL}"

    tar -xzf "${tmp_dir}/${CUDNN_TARBALL}" -C /usr/local

    cat <<'EOF' >>/etc/profile.d/cudnn.sh
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
EOF
  fi

  ldconfig

  echo "NVIDIA cuDNN successfully installed for ${OS_NAME}."
}

# Install NVIDIA GPU driver provided by NVIDIA
function install_nvidia_gpu_driver() {
  curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
    "${NVIDIA_UBUNTU_REPOSITORY_KEY}" | apt-key add -
  if [[ ${OS_NAME} == debian ]]; then
    curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
      "${NVIDIA_DEBIAN_GPU_DRIVER_URL}" -o driver.run
    bash "./driver.run" --silent --install-libglvnd

    curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
      "${NVIDIA_DEBIAN_CUDA_URL}" -o cuda.run
    bash "./cuda.run" --silent --toolkit --no-opengl-libs
  elif [[ ${OS_NAME} == ubuntu ]]; then
    curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
      "${NVIDIA_UBUNTU_REPOSITORY_CUDA_PIN}" -o /etc/apt/preferences.d/cuda-repository-pin-600

    add-apt-repository "deb ${NVIDIA_UBUNTU_REPOSITORY_URL} /"
    execute_with_retries "apt-get update"

    if [[ -n "${CUDA_VERSION}" ]]; then
      local -r cuda_package=cuda-${CUDA_VERSION//./-}
    else
      local -r cuda_package=cuda
    fi
    # Without --no-install-recommends this takes a very long time.
    execute_with_retries "apt-get install -y -q --no-install-recommends ${cuda_package}"
  else
    echo "Unsupported OS: '${OS_NAME}'"
    exit 1
  fi

  echo "NVIDIA GPU driver provided by NVIDIA was installed successfully"
}

# Install NVIDIA GPU driver provided by OS distribution
function install_os_gpu_driver() {
  local packages=(nvidia-cuda-toolkit)
  local modules=(nvidia-drm nvidia-uvm drm)

  # Add non-free Debian packages.
  # See https://www.debian.org/distrib/packages#note
  if [[ ${OS_NAME} == debian ]]; then
    for type in deb deb-src; do
      for distro in ${OS_DIST} ${OS_DIST}-backports; do
        echo "${type} http://deb.debian.org/debian ${distro} contrib non-free" \
          >>/etc/apt/sources.list.d/non-free.list
      done
    done

    packages+=(nvidia-driver nvidia-kernel-common nvidia-smi)
    modules+=(nvidia-current)
    local -r nvblas_cpu_blas_lib=/usr/lib/libblas.so
  elif [[ ${OS_NAME} == ubuntu ]]; then
    local nvidia_driver_version_ubuntu
    nvidia_driver_version_ubuntu=$(apt list 2>/dev/null | grep -E "^nvidia-driver-[0-9]+/" |
      cut -d/ -f1 | sort | tail -n1 | cut -d- -f3)
    # Ubuntu-specific NVIDIA driver packages and modules
    packages+=(
      "nvidia-driver-${nvidia_driver_version_ubuntu}"
      "nvidia-kernel-common-${nvidia_driver_version_ubuntu}")
    modules+=(nvidia)
    local -r nvblas_cpu_blas_lib=/usr/lib/x86_64-linux-gnu/libblas.so
  else
    echo "Unsupported OS: '${OS_NAME}'"
    exit 1
  fi

  # Install proprietary NVIDIA drivers and CUDA
  # See https://wiki.debian.org/NvidiaGraphicsDrivers
  # Without --no-install-recommends this takes a very long time.
  execute_with_retries "apt-get update"
  execute_with_retries \
    "apt-get install -y -q -t ${OS_DIST}-backports --no-install-recommends ${packages[*]}"

  # Create a system wide NVBLAS config
  # See http://docs.nvidia.com/cuda/nvblas/
  local -r nvblas_config_file=/etc/nvidia/nvblas.conf
  # Create config file if it does not exist - this file doesn't exist by default in Ubuntu
  mkdir -p "$(dirname ${nvblas_config_file})"
  cat <<EOF >>${nvblas_config_file}
# Insert here the CPU BLAS fallback library of your choice.
# The standard libblas.so.3 defaults to OpenBLAS, which does not have the
# requisite CBLAS API.
NVBLAS_CPU_BLAS_LIB ${nvblas_cpu_blas_lib}
# Use all GPUs
NVBLAS_GPU_LIST ALL
# Add more configuration here.
EOF
  echo "NVBLAS_CONFIG_FILE=${nvblas_config_file}" >>/etc/environment

  # Rebooting during an initialization action is not recommended, so just
  # dynamically load kernel modules. If you want to run an X server, it is
  # recommended that you schedule a reboot to occur after the initialization
  # action finishes.
  modprobe -r nouveau
  modprobe "${modules[@]}"

  # Restart any NodeManagers, so they pick up the NVBLAS config.
  if systemctl status hadoop-yarn-nodemanager; then
    # Kill Node Manager to prevent unregister/register cycle
    systemctl kill -s KILL hadoop-yarn-nodemanager
  fi

  echo "NVIDIA GPU driver provided by ${OS_NAME} was installed successfully"
}

# Collects 'gpu_utilization' and 'gpu_memory_utilization' metrics
function install_gpu_agent() {
  if ! command -v pip; then
    execute_with_retries "apt-get install -y -q python-pip"
  fi
  local install_dir=/opt/gpu-utilization-agent
  mkdir "${install_dir}"
  curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
    "${GPU_AGENT_REPO_URL}/requirements.txt" -o "${install_dir}/requirements.txt"
  curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
    "${GPU_AGENT_REPO_URL}/report_gpu_metrics.py" -o "${install_dir}/report_gpu_metrics.py"
  pip install -r "${install_dir}/requirements.txt"

  # Generate GPU service.
  cat <<EOF >/lib/systemd/system/gpu-utilization-agent.service
[Unit]
Description=GPU Utilization Metric Agent

[Service]
Type=simple
PIDFile=/run/gpu_agent.pid
ExecStart=/bin/bash --login -c 'python "${install_dir}/report_gpu_metrics.py"'
User=root
Group=root
WorkingDirectory=/
Restart=always

[Install]
WantedBy=multi-user.target
EOF
  # Reload systemd manager configuration
  systemctl daemon-reload
  # Enable gpu-utilization-agent service
  systemctl --no-reload --now enable gpu-utilization-agent.service
}

function set_hadoop_property() {
  local -r config_file=$1
  local -r property=$2
  local -r value=$3
  bdconfig set_property \
    --configuration_file "${HADOOP_CONF_DIR}/${config_file}" \
    --name "${property}" --value "${value}" \
    --clobber
}

function configure_yarn() {
  if [[ ! -f ${HADOOP_CONF_DIR}/resource-types.xml ]]; then
    printf '<?xml version="1.0" ?>\n<configuration/>' >"${HADOOP_CONF_DIR}/resource-types.xml"
  fi
  set_hadoop_property 'resource-types.xml' 'yarn.resource-types' 'yarn.io/gpu'

  set_hadoop_property 'capacity-scheduler.xml' \
    'yarn.scheduler.capacity.resource-calculator' \
    'org.apache.hadoop.yarn.util.resource.DominantResourceCalculator'

  set_hadoop_property 'yarn-site.xml' 'yarn.resource-types' 'yarn.io/gpu'
}

# This configuration should be applied only if GPU is attached to the node
function configure_yarn_nodemanager() {
  set_hadoop_property 'yarn-site.xml' 'yarn.nodemanager.resource-plugins' 'yarn.io/gpu'
  set_hadoop_property 'yarn-site.xml' \
    'yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices' 'auto'
  set_hadoop_property 'yarn-site.xml' \
    'yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables' '/usr/bin'
  set_hadoop_property 'yarn-site.xml' \
    'yarn.nodemanager.linux-container-executor.cgroups.mount' 'true'
  set_hadoop_property 'yarn-site.xml' \
    'yarn.nodemanager.linux-container-executor.cgroups.mount-path' '/sys/fs/cgroup'
  set_hadoop_property 'yarn-site.xml' \
    'yarn.nodemanager.linux-container-executor.cgroups.hierarchy' 'yarn'
  set_hadoop_property 'yarn-site.xml' \
    'yarn.nodemanager.container-executor.class' \
    'org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor'
  set_hadoop_property 'yarn-site.xml' 'yarn.nodemanager.linux-container-executor.group' 'yarn'

  # Fix local dirs access permissions
  local yarn_local_dirs=()
  readarray -d ',' yarn_local_dirs < <(bdconfig get_property_value \
    --configuration_file "${HADOOP_CONF_DIR}/yarn-site.xml" \
    --name "yarn.nodemanager.local-dirs" 2>/dev/null | tr -d '\n')
  chown yarn:yarn -R "${yarn_local_dirs[@]/,/}"
}

function configure_gpu_exclusive_mode() {
  # check if running spark 3, if not, enable GPU exclusive mode
  local spark_version
  spark_version=$(spark-submit --version 2>&1 | sed -n 's/.*version[[:blank:]]\+\([0-9]\+\.[0-9]\).*/\1/p' | head -n1)
  if [[ ${spark_version} != 3.* ]]; then
    # include exclusive mode on GPU
    nvidia-smi -c EXCLUSIVE_PROCESS
  fi
}

function configure_gpu_isolation() {
  # Download GPU discovery script
  local -r spark_gpu_script_dir='/usr/lib/spark/scripts/gpu'
  mkdir -p ${spark_gpu_script_dir}
  local -r gpu_resources_url=https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh
  curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
    "${gpu_resources_url}" -o ${spark_gpu_script_dir}/getGpusResources.sh
  chmod a+rwx -R ${spark_gpu_script_dir}

  # enable GPU isolation
  sed -i "s/yarn.nodemanager\.linux\-container\-executor\.group\=/yarn\.nodemanager\.linux\-container\-executor\.group\=yarn/g" "${HADOOP_CONF_DIR}/container-executor.cfg"
  printf '\n[gpu]\nmodule.enabled=true\n[cgroups]\nroot=/sys/fs/cgroup\nyarn-hierarchy=yarn\n' >>"${HADOOP_CONF_DIR}/container-executor.cfg"

  chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct
  chmod a+rwx -R /sys/fs/cgroup/devices
}

function main() {
  if [[ ${OS_NAME} != debian ]] && [[ ${OS_NAME} != ubuntu ]]; then
    echo "Unsupported OS: '${OS_NAME}'"
    exit 1
  fi

  export DEBIAN_FRONTEND=noninteractive
  execute_with_retries "apt-get update"
  execute_with_retries "apt-get install -y -q pciutils"

  # This configuration should be ran on all nodes
  # regardless if they have attached GPUs
  configure_yarn

  # Detect NVIDIA GPU
  if (lspci | grep -q NVIDIA); then
    configure_yarn_nodemanager
    configure_gpu_isolation

    execute_with_retries "apt-get install -y -q 'linux-headers-$(uname -r)'"

    if [[ ${GPU_DRIVER_PROVIDER} == 'NVIDIA' ]]; then
      install_nvidia_gpu_driver
      if [[ -n ${CUDNN_VERSION} ]]; then
        install_nvidia_nccl
        install_nvidia_cudnn
      fi
    elif [[ ${GPU_DRIVER_PROVIDER} == 'OS' ]]; then
      install_os_gpu_driver
    else
      echo "Unsupported GPU driver provider: '${GPU_DRIVER_PROVIDER}'"
      exit 1
    fi

    # Install GPU metrics collection in Stackdriver if needed
    if [[ ${INSTALL_GPU_AGENT} == true ]]; then
      install_gpu_agent
      echo 'GPU metrics agent successfully deployed.'
    else
      echo 'GPU metrics agent will not be installed.'
    fi

    configure_gpu_exclusive_mode
  elif [[ "${ROLE}" == "Master" ]]; then
    configure_yarn_nodemanager
    configure_gpu_isolation
  fi
}

main

Installieren des GPU-Treibers bestätigen

Nachdem Sie den GPU-Treiber auf den Dataproc-Knoten installiert haben, können Sie prüfen, ob der Treiber ordnungsgemäß funktioniert. Stellen Sie eine SSH-Verbindung zum Masterknoten Ihres Dataproc-Clusters her und führen Sie den folgenden Befehl aus:

nvidia-smi

Wenn der Treiber ordnungsgemäß funktioniert, umfasst die Ausgabe die Treiberversion und die GPU-Statistiken (siehe Installieren des GPU-Treibers bestätigen).

Spark-Konfiguration

Beim Senden von Jobs an Spark können Sie die folgende Spark-Konfiguration verwenden, um die erforderlichen Bibliotheken zu laden:

spark.executorEnv.LD_PRELOAD=libnvblas.so

Beispiel für GPU-Job

Sie können GPUs in Cloud Dataproc testen, indem Sie einen der folgenden Jobs ausführen:

  1. Führen Sie eines der Spark ML-Beispiele aus.
  2. Führen Sie das folgende Beispiel mit spark-shell aus, um eine Matrixberechnung durchzuführen:
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import java.util.Random

def makeRandomSquareBlockMatrix(rowsPerBlock: Int, nBlocks: Int): BlockMatrix = {
  val range = sc.parallelize(1 to nBlocks)
  val indices = range.cartesian(range)
  return new BlockMatrix(
      indices.map(
          ij => (ij, Matrices.rand(rowsPerBlock, rowsPerBlock, new Random()))),
      rowsPerBlock, rowsPerBlock, 0, 0)
}

val N = 1024 * 5
val n = 2
val mat1 = makeRandomSquareBlockMatrix(N, n)
val mat2 = makeRandomSquareBlockMatrix(N, n)
val mat3 = mat1.multiply(mat2)
mat3.blocks.persist.count
println("Processing complete!")

Nächste Schritte