此页面由 Cloud Translation API 翻译。

构建您自己的流水线组件

编写组件以显示 Google Cloud 控制台链接

通常，在运行组件时，您不仅会看到指向正在启动的组件作业的链接，而且还会看到底层云资源（如 Vertex 批量预测作业或 Dataflow 作业）的链接。

gcp_resource proto 是一个特殊参数，您可以在组件中使用该参数，以便 Google Cloud 控制台在 Vertex AI Pipelines 控制台中提供资源日志和状态的自定义视图。

输出 `gcp_resource` 参数

使用基于容器的组件

首先，您需要在组件中定义 gcp_resource 参数，如下面的 component.py 文件示例所示：

Python

如需了解如何安装或更新 Vertex AI SDK for Python，请参阅安装 Vertex AI SDK for Python。如需了解详情，请参阅 Python API 参考文档。

# Copyright 2023 The Kubeflow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List

from google_cloud_pipeline_components import _image
from google_cloud_pipeline_components import _placeholders
from kfp.dsl import container_component
from kfp.dsl import ContainerSpec
from kfp.dsl import OutputPath


@container_component
def dataflow_python(
    python_module_path: str,
    temp_location: str,
    gcp_resources: OutputPath(str),
    location: str = 'us-central1',
    requirements_file_path: str = '',
    args: List[str] = [],
    project: str = _placeholders.PROJECT_ID_PLACEHOLDER,
):
  # fmt: off
  """Launch a self-executing Beam Python file on Google Cloud using the
  Dataflow Runner.

  Args:
      location: Location of the Dataflow job. If not set, defaults to `'us-central1'`.
      python_module_path: The GCS path to the Python file to run.
      temp_location: A GCS path for Dataflow to stage temporary job files created during the execution of the pipeline.
      requirements_file_path: The GCS path to the pip requirements file.
      args: The list of args to pass to the Python file. Can include additional parameters for the Dataflow Runner.
      project: Project to create the Dataflow job. Defaults to the project in which the PipelineJob is run.

  Returns:
      gcp_resources: Serialized gcp_resources proto tracking the Dataflow job. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
  """
  # fmt: on
  return ContainerSpec(
      image=_image.GCPC_IMAGE_TAG,
      command=[
          'python3',
          '-u',
          '-m',
          'google_cloud_pipeline_components.container.v1.dataflow.dataflow_launcher',
      ],
      args=[
          '--project',
          project,
          '--location',
          location,
          '--python_module_path',
          python_module_path,
          '--temp_location',
          temp_location,
          '--requirements_file_path',
          requirements_file_path,
          '--args',
          args,
          '--gcp_resources',
          gcp_resources,
      ],
  )

接下来，在容器中安装 Google Cloud 流水线组件软件包：

pip install --upgrade google-cloud-pipeline-components

接下来，在 Python 代码中，将资源定义为 gcp_resource 参数：

Python

如需了解如何安装或更新 Vertex AI SDK for Python，请参阅安装 Vertex AI SDK for Python。如需了解详情，请参阅 Python API 参考文档。

from google_cloud_pipeline_components.proto.gcp_resources_pb2 import GcpResources
from google.protobuf.json_format import MessageToJson

dataflow_resources = GcpResources()
dr = dataflow_resources.resources.add()
dr.resource_type='DataflowJob'
dr.resource_uri='https://dataflow.googleapis.com/v1b3/projects/[your-project]/locations/us-east1/jobs/[dataflow-job-id]'

with open(gcp_resources, 'w') as f:
    f.write(MessageToJson(dataflow_resources))

使用 Python 组件

或者，您可以返回 gcp_resources 输出参数，就像返回任何字符串输出参数一样：

@dsl.component(
    base_image='python:3.9',
    packages_to_install=['google-cloud-pipeline-components==2.16.1'],
)
def launch_dataflow_component(project: str, location:str) -> NamedTuple("Outputs",  [("gcp_resources", str)]):
  # Launch the dataflow job
  dataflow_job_id = [dataflow-id]
  dataflow_resources = GcpResources()
  dr = dataflow_resources.resources.add()
  dr.resource_type='DataflowJob'
  dr.resource_uri=f'https://dataflow.googleapis.com/v1b3/projects/{project}/locations/{location}/jobs/{dataflow_job_id}'
  gcp_resources=MessageToJson(dataflow_resources)
  return gcp_resources

支持的 `resource_type` 值

您可以将 resource_type 设置为任意字符串，但只有以下类型在 Google Cloud 控制台中具有链接：

BatchPredictionJob
BigQueryJob
CustomJob
DataflowJob
HyperparameterTuningJob

编写组件以取消底层资源

取消流水线作业后，Google Cloud 底层资源默认会继续运行。它们不会自动取消。要更改此行为，您应该将 SIGTERM 处理程序附加到流水线作业。建议在长时间运行作业的轮询循环之前执行此操作。

多个 Google Cloud 流水线组件已实现取消操作，包括：

批量预测作业
BigQuery ML 作业
自定义作业
Dataproc Serverless 批处理作业
超参数调节作业

如需了解详情（包括说明如何附加 SIGTERM 处理程序的代码示例），请参阅以下 GitHub 链接：

实现 SIGTERM 处理程序时，请考虑以下事项：

取消传播仅在组件运行几分钟后才生效。这通常是因为需要在调用 Python 信号处理程序之前处理后台启动任务。
某些 Google Cloud 资源可能未实现取消操作。例如，创建或删除 Vertex AI 端点或模型可能会创建一个长时间运行的操作，以通过其 REST API 接受取消请求，但不会实现取消操作本身。

构建您自己的流水线组件

编写组件以显示 Google Cloud 控制台链接

输出 gcp_resource 参数

使用基于容器的组件

Python

Python

使用 Python 组件

支持的 resource_type 值

编写组件以取消底层资源

输出 `gcp_resource` 参数

支持的 `resource_type` 值