Mantieni tutto organizzato con le raccolte
Salva e classifica i contenuti in base alle tue preferenze.
.
Le norme di sicurezza dell'organizzazione, le regole di conformità normativa e altre considerazioni possono richiedere di "ruotare" i cluster Dataproc a intervalli regolari eliminandoli e ricreandoli in base a una pianificazione.
Nell'ambito della rotazione dei cluster, è possibile eseguire il provisioning di nuovi cluster con le versioni più recenti delle immagini Dataproc mantenendo le impostazioni di configurazione dei cluster sostituiti.
Questa pagina mostra come configurare i cluster che prevedi di ruotare ("cluster ruotati"), inviare job e poi ruotare i cluster in base alle esigenze.
Rotazione del cluster immagine personalizzata:
Puoi applicare personalizzazioni precedenti o nuove a un'immagine di base Dataproc precedente o nuova quando ricrei il cluster di immagini personalizzate.
Configura i cluster ruotati
Per configurare i cluster ruotati, crea nomi di cluster univoci con suffisso timestamp
per distinguere i cluster precedenti da quelli nuovi, quindi associa etichette ai cluster
che indicano se un cluster fa parte di un pool di cluster ruotati e riceve attivamente
nuovi invii di job. Questo esempio utilizza le etichette cluster-pool e
cluster-state=active per questi scopi, ma puoi utilizzare
i tuoi nomi di etichette.
cluster-pool-name: il nome del pool di cluster associato a
uno o più cluster. Questo nome viene utilizzato nel nome del cluster e con l'etichetta cluster-pool
associata al cluster per identificare il cluster come parte del pool.
Crea il cluster. Puoi aggiungere argomenti e utilizzare etichette diverse.
I seguenti esempi di Google Cloud CLI e
grafo grafo diretto aciclico (DAG) di Apache Airflow
inviano un job Apache Pig a un cluster. Le etichette del cluster vengono
utilizzate per inviare il job a un cluster attivo all'interno di un pool di cluster.
gcloud
Inviare un job Apache Pig che si trova in Cloud Storage. Scegli il cluster utilizzando le etichette.
Invia un job Apache Pig che si trova in Cloud Storage utilizzando Airflow.
Scegli il cluster utilizzando le etichette.
from airflow import DAG
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
from datetime import datetime
# Declare variables
project_id= # e.g: my-project
region="us-central1"
dag_id='pig_wordcount'
cluster_labels={"cluster-pool":${CLUSTER_POOL},
"cluster-state":"active"}
wordcount_script="gs://bucket-name/scripts/wordcount.pig"
# Define DAG
dag = DAG(
dag_id,
schedule_interval=None,
start_date=datetime(2023, 8, 16),
catchup=False
)
PIG_JOB = {
"reference": {"project_id": project_id},
"placement": {"cluster_labels": cluster_labels},
"pig_job": {"query_file_uri": wordcount_script},
}
wordcount_task = DataprocSubmitJobOperator(
task_id='wordcount',
region=region,
project_id=project_id,
job=PIG_JOB,
dag=dag
)
Ruota i cluster
Aggiorna le etichette del cluster associate ai cluster che stai ritirando. Questo
esempio utilizza l'etichetta cluster-state=pendingfordeletion per indicare che
il cluster non riceve nuovi invii di job e viene ritirato,
ma puoi utilizzare la tua etichetta a questo scopo.
Dopo l'aggiornamento dell'etichetta del cluster, il cluster non riceve nuovi job, poiché i job vengono inviati ai cluster all'interno di un pool di cluster solo con etichette active (vedi Invio di job ai cluster).
Elimina i cluster che stai ritirando dopo che hanno terminato l'esecuzione dei job.
[[["Facile da capire","easyToUnderstand","thumb-up"],["Il problema è stato risolto","solvedMyProblem","thumb-up"],["Altra","otherUp","thumb-up"]],[["Difficile da capire","hardToUnderstand","thumb-down"],["Informazioni o codice di esempio errati","incorrectInformationOrSampleCode","thumb-down"],["Mancano le informazioni o gli esempi di cui ho bisogno","missingTheInformationSamplesINeed","thumb-down"],["Problema di traduzione","translationIssue","thumb-down"],["Altra","otherDown","thumb-down"]],["Ultimo aggiornamento 2025-09-04 UTC."],[[["\u003cp\u003eDataproc clusters can be rotated at regular intervals to adhere to security policies and compliance rules, enabling the provisioning of new clusters with updated image versions while retaining configurations.\u003c/p\u003e\n"],["\u003cp\u003eRotated clusters are set up by assigning unique, timestamp-suffixed names and attaching labels like \u003ccode\u003ecluster-pool\u003c/code\u003e and \u003ccode\u003ecluster-state=active\u003c/code\u003e to distinguish and identify them within a pool.\u003c/p\u003e\n"],["\u003cp\u003eJobs can be submitted to active clusters within a cluster pool by using cluster labels to ensure that the job is directed to a cluster that is currently accepting new submissions.\u003c/p\u003e\n"],["\u003cp\u003eClusters are rotated by updating their labels to indicate they are no longer active, for example, by changing \u003ccode\u003ecluster-state=active\u003c/code\u003e to \u003ccode\u003ecluster-state=pendingfordeletion\u003c/code\u003e, which prevents them from receiving new jobs.\u003c/p\u003e\n"],["\u003cp\u003eClusters marked as ready for deletion can be removed after they have completed their current jobs, which can be automated using a monitoring script.\u003c/p\u003e\n"]]],[],null,[".\n\nOrganization security policies, regulatory compliance rules, and other\nconsiderations can prompt you to \"rotate\" your Dataproc clusters\nat regular intervals by deleting, then recreating clusters on a schedule.\nAs part of cluster rotation, new clusters can be provisioned with the latest\nDataproc image versions while retaining the configuration settings\nof the replaced clusters.\n\nThis page shows you how to set up clusters that you plan to rotate (\"rotated\nclusters\"), submit jobs to them, and then rotate the clusters as needed.\n\n[Custom image](/dataproc/docs/guides/dataproc-images) cluster rotation:\nYou can apply previous or new customizations to a previous or new\nDataproc base image when recreating the custom image cluster.\n\nSet up rotated clusters\n\nTo set up rotated clusters, create unique, timestamp-suffixed cluster names\nto distinguish previous from new clusters, and then attach labels to clusters\nthat indicate if a cluster is part of a rotated cluster pool and actively\nreceiving new job submissions. This example uses `cluster-pool` and\n`cluster-state=active` labels for these purposes, but you can use\nyour own label names.\n\n1. Set environment variables:\n\n ```\n PROJECT=project ID \\\n REGION=/compute/docs/regions-zones#available \\\n CLUSTER_POOL=cluster-pool-name \\\n CLUSTER_NAME=$CLUSTER_POOL-$(date '+%Y%m%d%H%M') \\\n BUCKET=Cloud Storage bucket-name\n ```\n\n \u003cbr /\u003e\n\n Notes:\n - \u003cvar translate=\"no\"\u003ecluster-pool-name\u003c/var\u003e: The name of the cluster pool associated with one or more clusters. This name is used in the cluster name and with the `cluster-pool` label attached to the cluster to identify the cluster as part of the pool.\n2. Create the cluster. You can add arguments and use different labels.\n\n ```\n gcloud dataproc clusters create ${CLUSTER_NAME} \\\n --project=${PROJECT_ID} \\\n --region=${REGION} \\\n --bucket=${BUCKET} \\\n --labels=\"cluster-pool=${CLUSTER_POOL},cluster-state=active\"\n ```\n\nSubmit jobs to clusters\n\nThe following Google Cloud CLI and\n[Apache Airflow directed acyclic graph (DAG)](/composer/docs/how-to/using/writing-dags)\nexamples submit an Apache Pig job to a cluster. Cluster labels are\nused to submit the job to an active cluster within a cluster pool. \n\ngcloud\n\nSubmit an Apache Pig job located in Cloud Storage. Pick the cluster using labels.\n\n\u003cbr /\u003e\n\n```\ngcloud dataproc jobs submit pig \\\n --region=${REGION} \\\n --file=gs://${BUCKET}/scripts/script.pig \\\n --cluster-labels=\"cluster-pool=${CLUSTER_POOL},cluster-state=active\"\n \n```\n\n\u003cbr /\u003e\n\nAirflow\n\nSubmit an Apache Pig job located in Cloud Storage using Airflow.\nPick the cluster using labels. \n\n```\nfrom airflow import DAG\nfrom airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator\nfrom datetime import datetime\n\n# Declare variables\nproject_id= # e.g: my-project\nregion=\"us-central1\"\ndag_id='pig_wordcount'\ncluster_labels={\"cluster-pool\":${CLUSTER_POOL},\n \"cluster-state\":\"active\"}\nwordcount_script=\"gs://bucket-name/scripts/wordcount.pig\"\n\n# Define DAG\n\ndag = DAG(\n dag_id,\n schedule_interval=None,\n start_date=datetime(2023, 8, 16),\n catchup=False\n)\n\nPIG_JOB = {\n \"reference\": {\"project_id\": project_id},\n \"placement\": {\"cluster_labels\": cluster_labels},\n \"pig_job\": {\"query_file_uri\": wordcount_script},\n}\n\nwordcount_task = DataprocSubmitJobOperator(\n task_id='wordcount',\n region=region,\n project_id=project_id,\n job=PIG_JOB,\n dag=dag\n)\n```\n\n\u003cbr /\u003e\n\nRotate clusters\n\n1. Update the cluster labels attached to the clusters you are rotating out. This\n examples uses the `cluster-state=pendingfordeletion` label to signify that\n the cluster is not receiving new job submissions and is being rotated out,\n but you can use your own label for this purpose.\n\n ```\n gcloud dataproc clusters update ${CLUSTER_NAME} \\\n --region=${REGION} \\\n --update-labels=\"cluster-state=pendingfordeletion\"\n ```\n\n \u003cbr /\u003e\n\n After the cluster label is updated, the cluster does not receive new jobs\n since jobs are submitted to clusters within a cluster pool\n with `active` labels only (see\n [Submit jobs to clusters](#submit_jobs_to_clusters)).\n2. Delete clusters you are rotating out after they finish running jobs.\n\n | **Note:** You can automate this step with a monitoring script that fetches clusters with the `cluster-state=pendingfordeletion` label (or other label you added with the previous command), checks that no jobs are running on the cluster, and then deletes the cluster."]]