Index
AutoscalingPolicyService
(interface)ClusterController
(interface)JobController
(interface)NodeGroupController
(interface)WorkflowTemplateService
(interface)AcceleratorConfig
(message)AutoscalingConfig
(message)AutoscalingPolicy
(message)AuxiliaryNodeGroup
(message)AuxiliaryServicesConfig
(message)BasicAutoscalingAlgorithm
(message)BasicYarnAutoscalingConfig
(message)CancelJobRequest
(message)Cluster
(message)ClusterConfig
(message)ClusterMetrics
(message)ClusterOperation
(message)ClusterOperationMetadata
(message)ClusterOperationStatus
(message)ClusterOperationStatus.State
(enum)ClusterSelector
(message)ClusterStatus
(message)ClusterStatus.State
(enum)ClusterStatus.Substate
(enum)Component
(enum)ConfidentialInstanceConfig
(message)CreateAutoscalingPolicyRequest
(message)CreateClusterRequest
(message)CreateWorkflowTemplateRequest
(message)DataprocMetricConfig
(message)DataprocMetricConfig.Metric
(message)DataprocMetricConfig.MetricSource
(enum)DeleteAutoscalingPolicyRequest
(message)DeleteClusterRequest
(message)DeleteJobRequest
(message)DeleteWorkflowTemplateRequest
(message)DiagnoseClusterRequest
(message)DiagnoseClusterResults
(message)DiskConfig
(message)DriverSchedulingConfig
(message)EncryptionConfig
(message)EndpointConfig
(message)FailureAction
(enum)GceClusterConfig
(message)GceClusterConfig.PrivateIpv6GoogleAccess
(enum)GetAutoscalingPolicyRequest
(message)GetClusterRequest
(message)GetJobRequest
(message)GetNodeGroupRequest
(message)GetWorkflowTemplateRequest
(message)GkeClusterConfig
(message)GkeClusterConfig.NamespacedGkeDeploymentTarget
(message) (deprecated)GkeNodePoolConfig
(message)GkeNodePoolConfig.GkeNodeConfig
(message)GkeNodePoolConfig.GkeNodePoolAcceleratorConfig
(message)GkeNodePoolConfig.GkeNodePoolAutoscalingConfig
(message)GkeNodePoolTarget
(message)GkeNodePoolTarget.Role
(enum)HadoopJob
(message)HiveJob
(message)IdentityConfig
(message)InstanceGroupAutoscalingPolicyConfig
(message)InstanceGroupConfig
(message)InstanceGroupConfig.Preemptibility
(enum)InstantiateInlineWorkflowTemplateRequest
(message)InstantiateWorkflowTemplateRequest
(message)Job
(message)JobMetadata
(message)JobPlacement
(message)JobReference
(message)JobScheduling
(message)JobStatus
(message)JobStatus.State
(enum)JobStatus.Substate
(enum)KerberosConfig
(message)KubernetesClusterConfig
(message)KubernetesSoftwareConfig
(message)LifecycleConfig
(message)ListAutoscalingPoliciesRequest
(message)ListAutoscalingPoliciesResponse
(message)ListClustersRequest
(message)ListClustersResponse
(message)ListJobsRequest
(message)ListJobsRequest.JobStateMatcher
(enum)ListJobsResponse
(message)ListWorkflowTemplatesRequest
(message)ListWorkflowTemplatesResponse
(message)LoggingConfig
(message)LoggingConfig.Level
(enum)ManagedCluster
(message)ManagedGroupConfig
(message)MetastoreConfig
(message)NodeGroup
(message)NodeGroup.Role
(enum)NodeGroupAffinity
(message)NodeGroupOperationMetadata
(message)NodeGroupOperationMetadata.NodeGroupOperationType
(enum)NodeInitializationAction
(message)OrderedJob
(message)ParameterValidation
(message)PigJob
(message)PrestoJob
(message)PySparkJob
(message)QueryList
(message)RegexValidation
(message)ReservationAffinity
(message)ReservationAffinity.Type
(enum)ResizeNodeGroupRequest
(message)SecurityConfig
(message)ShieldedInstanceConfig
(message)SoftwareConfig
(message)SparkHistoryServerConfig
(message)SparkJob
(message)SparkRJob
(message)SparkSqlJob
(message)StartClusterRequest
(message)StopClusterRequest
(message)SubmitJobRequest
(message)TemplateParameter
(message)UpdateAutoscalingPolicyRequest
(message)UpdateClusterRequest
(message)UpdateJobRequest
(message)UpdateWorkflowTemplateRequest
(message)ValueValidation
(message)VirtualClusterConfig
(message)WorkflowGraph
(message)WorkflowMetadata
(message)WorkflowMetadata.State
(enum)WorkflowNode
(message)WorkflowNode.NodeState
(enum)WorkflowTemplate
(message)WorkflowTemplatePlacement
(message)YarnApplication
(message)YarnApplication.State
(enum)
AutoscalingPolicyService
The API interface for managing autoscaling policies in the Dataproc API.
CreateAutoscalingPolicy |
---|
Creates new autoscaling policy.
|
DeleteAutoscalingPolicy |
---|
Deletes an autoscaling policy. It is an error to delete an autoscaling policy that is in use by one or more clusters.
|
GetAutoscalingPolicy |
---|
Retrieves autoscaling policy.
|
ListAutoscalingPolicies |
---|
Lists autoscaling policies in the project.
|
UpdateAutoscalingPolicy |
---|
Updates (replaces) autoscaling policy. Disabled check for update_mask, because all updates will be full replacements.
|
ClusterController
The ClusterControllerService provides methods to manage clusters of Compute Engine instances.
CreateCluster |
---|
Creates a cluster in a project. The returned
|
DeleteCluster |
---|
Deletes a cluster in a project. The returned
|
DiagnoseCluster |
---|
Gets cluster diagnostic information. The returned
|
GetCluster |
---|
Gets the resource representation for a cluster in a project.
|
ListClusters |
---|
Lists all regions/{region}/clusters in a project alphabetically.
|
StartCluster |
---|
Starts a cluster in a project.
|
StopCluster |
---|
Stops a cluster in a project.
|
UpdateCluster |
---|
Updates a cluster in a project. The returned
|
JobController
The JobController provides methods to manage jobs.
CancelJob |
---|
Starts a job cancellation request. To access the job resource after cancellation, call regions/{region}/jobs.list or regions/{region}/jobs.get.
|
DeleteJob |
---|
Deletes the job from the project. If the job is active, the delete fails, and the response returns
|
GetJob |
---|
Gets the resource representation for a job in a project.
|
ListJobs |
---|
Lists regions/{region}/jobs in a project.
|
SubmitJob |
---|
Submits a job to a cluster.
|
SubmitJobAsOperation |
---|
Submits job to a cluster.
|
UpdateJob |
---|
Updates a job in a project.
|
NodeGroupController
The NodeGroupControllerService
provides methods to manage node groups of Compute Engine managed instances.
GetNodeGroup |
---|
Gets the resource representation for a node group in a cluster.
|
ResizeNodeGroup |
---|
Resizes a node group in a cluster. The returned
|
WorkflowTemplateService
The API interface for managing Workflow Templates in the Dataproc API.
CreateWorkflowTemplate |
---|
Creates new workflow template.
|
DeleteWorkflowTemplate |
---|
Deletes a workflow template. It does not cancel in-progress workflows.
|
GetWorkflowTemplate |
---|
Retrieves the latest workflow template. Can retrieve previously instantiated template by specifying optional version parameter.
|
InstantiateInlineWorkflowTemplate |
---|
Instantiates a template and begins execution. This method is equivalent to executing the sequence The returned Operation can be used to track execution of workflow by polling The running workflow can be aborted via The On successful completion,
|
InstantiateWorkflowTemplate |
---|
Instantiates a template and begins execution. The returned Operation can be used to track execution of workflow by polling The running workflow can be aborted via The On successful completion,
|
ListWorkflowTemplates |
---|
Lists workflows that match the specified filter in the request.
|
UpdateWorkflowTemplate |
---|
Updates (replaces) workflow template. The updated template must contain version that matches the current server version.
|
AcceleratorConfig
Specifies the type and number of accelerator cards attached to the instances of an instance. See GPUs on Compute Engine.
Fields | |
---|---|
accelerator_type_uri |
Full URL, partial URI, or short name of the accelerator type resource to expose to this instance. See Compute Engine AcceleratorTypes. Examples:
Auto Zone Exception: If you are using the Dataproc Auto Zone Placement feature, you must use the short name of the accelerator type resource, for example, |
accelerator_count |
The number of the accelerator cards of this type exposed to this instance. |
AutoscalingConfig
Autoscaling Policy config associated with the cluster.
Fields | |
---|---|
policy_uri |
Optional. The autoscaling policy used by the cluster. Only resource names including projectid and location (region) are valid. Examples:
Note that the policy must be in the same project and Dataproc region. |
AutoscalingPolicy
Describes an autoscaling policy for Dataproc cluster autoscaler.
Fields | |
---|---|
id |
Required. The policy id. The id must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), and hyphens (-). Cannot begin or end with underscore or hyphen. Must consist of between 3 and 50 characters. |
name |
Output only. The "resource name" of the autoscaling policy, as described in https://cloud.google.com/apis/design/resource_names.
|
worker_config |
Required. Describes how the autoscaler will operate for primary workers. |
secondary_worker_config |
Optional. Describes how the autoscaler will operate for secondary workers. |
Union field algorithm . Autoscaling algorithm for policy. algorithm can be only one of the following: |
|
basic_algorithm |
AuxiliaryNodeGroup
Node group identification and configuration information.
Fields | |
---|---|
node_group |
Required. Node group configuration. |
node_group_id |
Optional. A node group ID. Generated if not specified. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), and hyphens (-). Cannot begin or end with underscore or hyphen. Must consist of from 3 to 33 characters. |
AuxiliaryServicesConfig
Auxiliary services configuration for a Cluster.
Fields | |
---|---|
metastore_config |
Optional. The Hive Metastore configuration for this workload. |
spark_history_server_config |
Optional. The Spark History Server configuration for the workload. |
BasicAutoscalingAlgorithm
Basic algorithm for autoscaling.
Fields | |
---|---|
cooldown_period |
Optional. Duration between scaling events. A scaling period starts after the update operation from the previous event has completed. Bounds: [2m, 1d]. Default: 2m. |
Union field
|
|
yarn_config |
Optional. YARN autoscaling configuration. |
BasicYarnAutoscalingConfig
Basic autoscaling configurations for YARN.
Fields | |
---|---|
graceful_decommission_timeout |
Required. Timeout for YARN graceful decommissioning of Node Managers. Specifies the duration to wait for jobs to complete before forcefully removing workers (and potentially interrupting jobs). Only applicable to downscaling operations. Bounds: [0s, 1d]. |
scale_up_factor |
Required. Fraction of average YARN pending memory in the last cooldown period for which to add workers. A scale-up factor of 1.0 will result in scaling up so that there is no pending memory remaining after the update (more aggressive scaling). A scale-up factor closer to 0 will result in a smaller magnitude of scaling up (less aggressive scaling). See How autoscaling works for more information. Bounds: [0.0, 1.0]. |
scale_down_factor |
Required. Fraction of average YARN pending memory in the last cooldown period for which to remove workers. A scale-down factor of 1 will result in scaling down so that there is no available memory remaining after the update (more aggressive scaling). A scale-down factor of 0 disables removing workers, which can be beneficial for autoscaling a single job. See How autoscaling works for more information. Bounds: [0.0, 1.0]. |
scale_up_min_worker_fraction |
Optional. Minimum scale-up threshold as a fraction of total cluster size before scaling occurs. For example, in a 20-worker cluster, a threshold of 0.1 means the autoscaler must recommend at least a 2-worker scale-up for the cluster to scale. A threshold of 0 means the autoscaler will scale up on any recommended change. Bounds: [0.0, 1.0]. Default: 0.0. |
scale_down_min_worker_fraction |
Optional. Minimum scale-down threshold as a fraction of total cluster size before scaling occurs. For example, in a 20-worker cluster, a threshold of 0.1 means the autoscaler must recommend at least a 2 worker scale-down for the cluster to scale. A threshold of 0 means the autoscaler will scale down on any recommended change. Bounds: [0.0, 1.0]. Default: 0.0. |
CancelJobRequest
A request to cancel a job.
Fields | |
---|---|
project_id |
Required. The ID of the Google Cloud Platform project that the job belongs to. |
region |
Required. The Dataproc region in which to handle the request. |
job_id |
Required. The job ID. Authorization requires the following IAM permission on the specified resource
|
Cluster
Describes the identifying information, config, and status of a Dataproc cluster
Fields | |
---|---|
project_id |
Required. The Google Cloud Platform project ID that the cluster belongs to. |
cluster_name |
Required. The cluster name, which must be unique within a project. The name must start with a lowercase letter, and can contain up to 51 lowercase letters, numbers, and hyphens. It cannot end with a hyphen. The name of a deleted cluster can be reused. |
config |
Optional. The cluster config for a cluster of Compute Engine Instances. Note that Dataproc may set default values, and values may change when clusters are updated. Exactly one of ClusterConfig or VirtualClusterConfig must be specified. |
virtual_cluster_config |
Optional. The virtual cluster config is used when creating a Dataproc cluster that does not directly control the underlying compute resources, for example, when creating a Dataproc-on-GKE cluster. Dataproc may set default values, and values may change when clusters are updated. Exactly one of |
labels |
Optional. The labels to associate with this cluster. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a cluster. |
status |
Output only. Cluster status. |
status_history[] |
Output only. The previous cluster status. |
cluster_uuid |
Output only. A cluster UUID (Unique Universal Identifier). Dataproc generates this value when it creates the cluster. |
metrics |
Output only. Contains cluster daemon metrics such as HDFS and YARN stats. Beta Feature: This report is available for testing purposes only. It may be changed before final release. |
ClusterConfig
The cluster config.
Fields | |
---|---|
config_bucket |
Optional. A Cloud Storage bucket used to stage job dependencies, config files, and job driver console output. If you do not specify a staging bucket, Cloud Dataproc will determine a Cloud Storage location (US, ASIA, or EU) for your cluster's staging bucket according to the Compute Engine zone where your cluster is deployed, and then create and manage this project-level, per-location bucket (see Dataproc staging and temp buckets). This field requires a Cloud Storage bucket name, not a |
temp_bucket |
Optional. A Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files. If you do not specify a temp bucket, Dataproc will determine a Cloud Storage location (US, ASIA, or EU) for your cluster's temp bucket according to the Compute Engine zone where your cluster is deployed, and then create and manage this project-level, per-location bucket. The default bucket has a TTL of 90 days, but you can use any TTL (or none) if you specify a bucket (see Dataproc staging and temp buckets). This field requires a Cloud Storage bucket name, not a |
gce_cluster_config |
Optional. The shared Compute Engine config settings for all instances in a cluster. |
master_config |
Optional. The Compute Engine config settings for the cluster's master instance. |
worker_config |
Optional. The Compute Engine config settings for the cluster's worker instances. |
secondary_worker_config |
Optional. The Compute Engine config settings for a cluster's secondary worker instances |
software_config |
Optional. The config settings for cluster software. |
initialization_actions[] |
Optional. Commands to execute on each node after config is completed. By default, executables are run on master and all worker nodes. You can test a node's
|
encryption_config |
Optional. Encryption settings for the cluster. |
autoscaling_config |
Optional. Autoscaling config for the policy associated with the cluster. Cluster does not autoscale if this field is unset. |
security_config |
Optional. Security settings for the cluster. |
lifecycle_config |
Optional. Lifecycle setting for the cluster. |
endpoint_config |
Optional. Port/endpoint configuration for this cluster |
metastore_config |
Optional. Metastore configuration. |
dataproc_metric_config |
Optional. The config for Dataproc metrics. |
auxiliary_node_groups[] |
Optional. The node group settings. |
ClusterMetrics
Contains cluster daemon metrics, such as HDFS and YARN stats.
Beta Feature: This report is available for testing purposes only. It may be changed before final release.
Fields | |
---|---|
hdfs_metrics |
The HDFS metrics. |
yarn_metrics |
YARN metrics. |
ClusterOperation
The cluster operation triggered by a workflow.
Fields | |
---|---|
operation_id |
Output only. The id of the cluster operation. |
error |
Output only. Error, if operation failed. |
done |
Output only. Indicates the operation is done. |
ClusterOperationMetadata
Metadata describing the operation.
Fields | |
---|---|
cluster_name |
Output only. Name of the cluster for the operation. |
cluster_uuid |
Output only. Cluster UUID for the operation. |
status |
Output only. Current operation status. |
status_history[] |
Output only. The previous operation status. |
operation_type |
Output only. The operation type. |
description |
Output only. Short description of operation. |
labels |
Output only. Labels associated with the operation |
warnings[] |
Output only. Errors encountered during operation execution. |
child_operation_ids[] |
Output only. Child operation ids |
ClusterOperationStatus
The status of the operation.
Fields | |
---|---|
state |
Output only. A message containing the operation state. |
inner_state |
Output only. A message containing the detailed operation state. |
details |
Output only. A message containing any operation metadata details. |
state_start_time |
Output only. The time this state was entered. |
State
The operation state.
Enums | |
---|---|
UNKNOWN |
Unused. |
PENDING |
The operation has been created. |
RUNNING |
The operation is running. |
DONE |
The operation is done; either cancelled or completed. |
ClusterSelector
A selector that chooses target cluster for jobs based on metadata.
Fields | |
---|---|
zone |
Optional. The zone where workflow process executes. This parameter does not affect the selection of the cluster. If unspecified, the zone of the first cluster matching the selector is used. |
cluster_labels |
Required. The cluster labels. Cluster must have all labels to match. |
ClusterStatus
The status of a cluster and its instances.
Fields | |
---|---|
state |
Output only. The cluster's state. |
detail |
Optional. Output only. Details of cluster's state. |
state_start_time |
Output only. Time when this state was entered (see JSON representation of Timestamp). |
substate |
Output only. Additional state information that includes status reported by the agent. |
State
The cluster state.
Enums | |
---|---|
UNKNOWN |
The cluster state is unknown. |
CREATING |
The cluster is being created and set up. It is not ready for use. |
RUNNING |
The cluster is currently running and healthy. It is ready for use. Note: The cluster state changes from "creating" to "running" status after the master node(s), first two primary worker nodes (and the last primary worker node if primary workers > 2) are running. |
ERROR |
The cluster encountered an error. It is not ready for use. |
ERROR_DUE_TO_UPDATE |
The cluster has encountered an error while being updated. Jobs can be submitted to the cluster, but the cluster cannot be updated. |
DELETING |
The cluster is being deleted. It cannot be used. |
UPDATING |
The cluster is being updated. It continues to accept and process jobs. |
STOPPING |
The cluster is being stopped. It cannot be used. |
STOPPED |
The cluster is currently stopped. It is not ready for use. |
STARTING |
The cluster is being started. It is not ready for use. |
Substate
The cluster substate.
Enums | |
---|---|
UNSPECIFIED |
The cluster substate is unknown. |
UNHEALTHY |
The cluster is known to be in an unhealthy state (for example, critical daemons are not running or HDFS capacity is exhausted). Applies to RUNNING state. |
STALE_STATUS |
The agent-reported status is out of date (may occur if Dataproc loses communication with Agent). Applies to RUNNING state. |
Component
Cluster components that can be activated.
Enums | |
---|---|
COMPONENT_UNSPECIFIED |
Unspecified component. Specifying this will cause Cluster creation to fail. |
ANACONDA |
The Anaconda python distribution. The Anaconda component is not supported in the Dataproc 2.0 image. The 2.0 image is pre-installed with Miniconda. |
DOCKER |
Docker |
DRUID |
The Druid query engine. (alpha) |
FLINK |
Flink |
HBASE |
HBase. (beta) |
HIVE_WEBHCAT |
The Hive Web HCatalog (the REST service for accessing HCatalog). |
HUDI |
Hudi. |
JUPYTER |
The Jupyter Notebook. |
PRESTO |
The Presto query engine. |
RANGER |
The Ranger service. |
SOLR |
The Solr service. |
ZEPPELIN |
The Zeppelin notebook. |
ZOOKEEPER |
The Zookeeper service. |
ConfidentialInstanceConfig
Confidential Instance Config for clusters using Confidential VMs
Fields | |
---|---|
enable_confidential_compute |
Optional. Defines whether the instance should have confidential compute enabled. |
CreateAutoscalingPolicyRequest
A request to create an autoscaling policy.
Fields | |
---|---|
parent |
Required. The "resource name" of the region or location, as described in https://cloud.google.com/apis/design/resource_names.
Authorization requires the following IAM permission on the specified resource
|
policy |
Required. The autoscaling policy to create. |
CreateClusterRequest
A request to create a cluster.
Fields | |
---|---|
project_id |
Required. The ID of the Google Cloud Platform project that the cluster belongs to. Authorization requires the following IAM permission on the specified resource
|
region |
Required. The Dataproc region in which to handle the request. |
cluster |
Required. The cluster to create. |
request_id |
Optional. A unique ID used to identify the request. If the server receives two CreateClusterRequests with the same id, then the second request will be ignored and the first It is recommended to always set this value to a UUID. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), and hyphens (-). The maximum length is 40 characters. |
action_on_failed_primary_workers |
Optional. Failure action when primary worker creation fails. |
CreateWorkflowTemplateRequest
A request to create a workflow template.
Fields | |
---|---|
parent |
Required. The resource name of the region or location, as described in https://cloud.google.com/apis/design/resource_names.
Authorization requires the following IAM permission on the specified resource
|
template |
Required. The Dataproc workflow template to create. |
DataprocMetricConfig
Dataproc metric config.
Fields | |
---|---|
metrics[] |
Required. Metrics sources to enable. |
Metric
A Dataproc custom metric.
Fields | |
---|---|
metric_source |
Required. A standard set of metrics is collected unless |
metric_overrides[] |
Optional. Specify one or more Custom metrics to collect for the metric course (for the Provide metrics in the following format:
Use camelcase as appropriate. Examples:
Notes:
|
MetricSource
A source for the collection of Dataproc custom metrics (see Custom metrics).
Enums | |
---|---|
METRIC_SOURCE_UNSPECIFIED |
Required unspecified metric source. |
MONITORING_AGENT_DEFAULTS |
Monitoring agent metrics. If this source is enabled, Dataproc enables the monitoring agent in Compute Engine, and collects monitoring agent metrics, which are published with an agent.googleapis.com prefix. |
HDFS |
HDFS metric source. |
SPARK |
Spark metric source. |
YARN |
YARN metric source. |
SPARK_HISTORY_SERVER |
Spark History Server metric source. |
HIVESERVER2 |
Hiveserver2 metric source. |
HIVEMETASTORE |
hivemetastore metric source |
DeleteAutoscalingPolicyRequest
A request to delete an autoscaling policy.
Autoscaling policies in use by one or more clusters will not be deleted.
Fields | |
---|---|
name |
Required. The "resource name" of the autoscaling policy, as described in https://cloud.google.com/apis/design/resource_names.
Authorization requires the following IAM permission on the specified resource
|
DeleteClusterRequest
A request to delete a cluster.
Fields | |
---|---|
project_id |
Required. The ID of the Google Cloud Platform project that the cluster belongs to. |
region |
Required. The Dataproc region in which to handle the request. |
cluster_name |
Required. The cluster name. Authorization requires the following IAM permission on the specified resource
|
cluster_uuid |
Optional. Specifying the |
request_id |
Optional. A unique ID used to identify the request. If the server receives two DeleteClusterRequests with the same id, then the second request will be ignored and the first It is recommended to always set this value to a UUID. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), and hyphens (-). The maximum length is 40 characters. |
DeleteJobRequest
A request to delete a job.
Fields | |
---|---|
project_id |
Required. The ID of the Google Cloud Platform project that the job belongs to. |
region |
Required. The Dataproc region in which to handle the request. |
job_id |
Required. The job ID. Authorization requires the following IAM permission on the specified resource
|
DeleteWorkflowTemplateRequest
A request to delete a workflow template.
Currently started workflows will remain running.
Fields | |
---|---|
name |
Required. The resource name of the workflow template, as described in https://cloud.google.com/apis/design/resource_names.
Authorization requires the following IAM permission on the specified resource
|
version |
Optional. The version of workflow template to delete. If specified, will only delete the template if the current server version matches specified version. |
DiagnoseClusterRequest
A request to collect cluster diagnostic information.
Fields | |
---|---|
project_id |
Required. The ID of the Google Cloud Platform project that the cluster belongs to. |
region |
Required. The Dataproc region in which to handle the request. |
cluster_name |
Required. The cluster name. Authorization requires the following IAM permission on the specified resource
|
DiagnoseClusterResults
The location of diagnostic output.
Fields | |
---|---|
output_uri |
Output only. The Cloud Storage URI of the diagnostic output. The output report is a plain text file with a summary of collected diagnostics. |
DiskConfig
Specifies the config of disk options for a group of VM instances.
Fields | |
---|---|
boot_disk_type |
Optional. Type of the boot disk (default is "pd-standard"). Valid values: "pd-balanced" (Persistent Disk Balanced Solid State Drive), "pd-ssd" (Persistent Disk Solid State Drive), or "pd-standard" (Persistent Disk Hard Disk Drive). See Disk types. |
boot_disk_size_gb |
Optional. Size in GB of the boot disk (default is 500GB). |
num_local_ssds |
Optional. Number of attached SSDs, from 0 to 8 (default is 0). If SSDs are not attached, the boot disk is used to store runtime logs and HDFS data. If one or more SSDs are attached, this runtime bulk data is spread across them, and the boot disk contains only basic config and installed binaries. Note: Local SSD options may vary by machine type and number of vCPUs selected. |
local_ssd_interface |
Optional. Interface type of local SSDs (default is "scsi"). Valid values: "scsi" (Small Computer System Interface), "nvme" (Non-Volatile Memory Express). See local SSD performance. |
DriverSchedulingConfig
Driver scheduling configuration.
Fields | |
---|---|
memory_mb |
Required. The amount of memory in MB the driver is requesting. |
vcores |
Required. The number of vCPUs the driver is requesting. |
EncryptionConfig
Encryption settings for the cluster.
Fields | |
---|---|
gce_pd_kms_key_name |
Optional. The Cloud KMS key name to use for PD disk encryption for all instances in the cluster. |
EndpointConfig
Endpoint config for this cluster
Fields | |
---|---|
http_ports |
Output only. The map of port descriptions to URLs. Will only be populated if enable_http_port_access is true. |
enable_http_port_access |
Optional. If true, enable http access to specific ports on the cluster from external sources. Defaults to false. |
FailureAction
Actions in response to failure of a resource associated with a cluster.
Enums | |
---|---|
FAILURE_ACTION_UNSPECIFIED |
When FailureAction is unspecified, failure action defaults to NO_ACTION. |
NO_ACTION |
Take no action on failure to create a cluster resource. NO_ACTION is the default. |
DELETE |
Delete the failed cluster resource. |
GceClusterConfig
Common config settings for resources of Compute Engine cluster instances, applicable to all instances in the cluster.
Fields | |
---|---|
zone_uri |
Optional. The Compute Engine zone where the Dataproc cluster will be located. If omitted, the service will pick a zone in the cluster's Compute Engine region. On a get request, zone will always be present. A full URL, partial URI, or short name are valid. Examples:
|
network_uri |
Optional. The Compute Engine network to be used for machine communications. Cannot be specified with subnetwork_uri. If neither A full URL, partial URI, or short name are valid. Examples:
|
subnetwork_uri |
Optional. The Compute Engine subnetwork to be used for machine communications. Cannot be specified with network_uri. A full URL, partial URI, or short name are valid. Examples:
|
private_ipv6_google_access |
Optional. The type of IPv6 access for a cluster. |
service_account |
Optional. The Dataproc service account (also see VM Data Plane identity) used by Dataproc cluster VM instances to access Google Cloud Platform services. If not specified, the Compute Engine default service account is used. |
service_account_scopes[] |
Optional. The URIs of service account scopes to be included in Compute Engine instances. The following base set of scopes is always included:
If no scopes are specified, the following defaults are also provided: |
tags[] |
The Compute Engine tags to add to all instances (see Tagging instances). |
metadata |
The Compute Engine metadata entries to add to all instances (see Project and instance metadata). |
reservation_affinity |
Optional. Reservation Affinity for consuming Zonal reservation. |
node_group_affinity |
Optional. Node Group Affinity for sole-tenant clusters. |
shielded_instance_config |
Optional. Shielded Instance Config for clusters using Compute Engine Shielded VMs. |
confidential_instance_config |
Optional. Confidential Instance Config for clusters using Confidential VMs. |
internal_ip_only |
Optional. If true, all instances in the cluster will only have internal IP addresses. By default, clusters are not restricted to internal IP addresses, and will have ephemeral external IP addresses assigned to each instance. This |
PrivateIpv6GoogleAccess
PrivateIpv6GoogleAccess
controls whether and how Dataproc cluster nodes can communicate with Google Services through gRPC over IPv6. These values are directly mapped to corresponding values in the Compute Engine Instance fields.
Enums | |
---|---|
PRIVATE_IPV6_GOOGLE_ACCESS_UNSPECIFIED |
If unspecified, Compute Engine default behavior will apply, which is the same as INHERIT_FROM_SUBNETWORK . |
INHERIT_FROM_SUBNETWORK |
Private access to and from Google Services configuration inherited from the subnetwork configuration. This is the default Compute Engine behavior. |
OUTBOUND |
Enables outbound private IPv6 access to Google Services from the Dataproc cluster. |
BIDIRECTIONAL |
Enables bidirectional private IPv6 access between Google Services and the Dataproc cluster. |
GetAutoscalingPolicyRequest
A request to fetch an autoscaling policy.
Fields | |
---|---|
name |
Required. The "resource name" of the autoscaling policy, as described in https://cloud.google.com/apis/design/resource_names.
Authorization requires the following IAM permission on the specified resource
|
GetClusterRequest
Request to get the resource representation for a cluster in a project.
Fields | |
---|---|
project_id |
Required. The ID of the Google Cloud Platform project that the cluster belongs to. |
region |
Required. The Dataproc region in which to handle the request. |
cluster_name |
Required. The cluster name. Authorization requires the following IAM permission on the specified resource
|
GetJobRequest
A request to get the resource representation for a job in a project.
Fields | |
---|---|
project_id |
Required. The ID of the Google Cloud Platform project that the job belongs to. |
region |
Required. The Dataproc region in which to handle the request. |
job_id |
Required. The job ID. Authorization requires the following IAM permission on the specified resource
|
GetNodeGroupRequest
A request to get a node group .
Fields | |
---|---|
name |
Required. The name of the node group to retrieve. Format: |
GetWorkflowTemplateRequest
A request to fetch a workflow template.
Fields | |
---|---|
name |
Required. The resource name of the workflow template, as described in https://cloud.google.com/apis/design/resource_names.
Authorization requires the following IAM permission on the specified resource
|
version |
Optional. The version of workflow template to retrieve. Only previously instantiated versions can be retrieved. If unspecified, retrieves the current version. |
GkeClusterConfig
The cluster's GKE config.
Fields | |
---|---|
namespaced_gke_deployment_target |
Optional. Deprecated. Use gkeClusterTarget. Used only for the deprecated beta. A target for the deployment. |
gke_cluster_target |
Optional. A target GKE cluster to deploy to. It must be in the same project and region as the Dataproc cluster (the GKE cluster can be zonal or regional). Format: 'projects/{project}/locations/{location}/clusters/{cluster_id}' |
node_pool_target[] |
Optional. GKE node pools where workloads will be scheduled. At least one node pool must be assigned the |
NamespacedGkeDeploymentTarget
Deprecated. Used only for the deprecated beta. A full, namespace-isolated deployment target for an existing GKE cluster.
Fields | |
---|---|
target_gke_cluster |
Optional. The target GKE cluster to deploy to. Format: 'projects/{project}/locations/{location}/clusters/{cluster_id}' |
cluster_namespace |
Optional. A namespace within the GKE cluster to deploy into. |
GkeNodePoolConfig
The configuration of a GKE node pool used by a Dataproc-on-GKE cluster.
Fields | |
---|---|
config |
Optional. The node pool configuration. |
locations[] |
Optional. The list of Compute Engine zones where node pool nodes associated with a Dataproc on GKE virtual cluster will be located. Note: All node pools associated with a virtual cluster must be located in the same region as the virtual cluster, and they must be located in the same zone within that region. If a location is not specified during node pool creation, Dataproc on GKE will choose the zone. |
autoscaling |
Optional. The autoscaler configuration for this node pool. The autoscaler is enabled only when a valid configuration is present. |
GkeNodeConfig
Parameters that describe cluster nodes.
Fields | |
---|---|
machine_type |
Optional. The name of a Compute Engine machine type. |
local_ssd_count |
Optional. The number of local SSD disks to attach to the node, which is limited by the maximum number of disks allowable per zone (see Adding Local SSDs). |
preemptible |
Optional. Whether the nodes are created as legacy preemptible VM instances. Also see |
accelerators[] |
Optional. A list of hardware accelerators to attach to each node. |
min_cpu_platform |
Optional. Minimum CPU platform to be used by this instance. The instance may be scheduled on the specified or a newer CPU platform. Specify the friendly names of CPU platforms, such as "Intel Haswell"` or Intel Sandy Bridge". |
spot |
Optional. Whether the nodes are created as Spot VM instances. Spot VMs are the latest update to legacy |
GkeNodePoolAcceleratorConfig
A GkeNodeConfigAcceleratorConfig represents a Hardware Accelerator request for a node pool.
Fields | |
---|---|
accelerator_count |
The number of accelerator cards exposed to an instance. |
accelerator_type |
The accelerator type resource namename (see GPUs on Compute Engine). |
gpu_partition_size |
Size of partitions to create on the GPU. Valid values are described in the NVIDIA mig user guide. |
GkeNodePoolAutoscalingConfig
GkeNodePoolAutoscaling contains information the cluster autoscaler needs to adjust the size of the node pool to the current cluster usage.
Fields | |
---|---|
min_node_count |
The minimum number of nodes in the node pool. Must be >= 0 and <= max_node_count. |
max_node_count |
The maximum number of nodes in the node pool. Must be >= min_node_count, and must be > 0. Note: Quota must be sufficient to scale up the cluster. |
GkeNodePoolTarget
GKE node pools that Dataproc workloads run on.
Fields | |
---|---|
node_pool |
Required. The target GKE node pool. Format: 'projects/{project}/locations/{location}/clusters/{cluster}/nodePools/{node_pool}' |
roles[] |
Required. The roles associated with the GKE node pool. |
node_pool_config |
Input only. The configuration for the GKE node pool. If specified, Dataproc attempts to create a node pool with the specified shape. If one with the same name already exists, it is verified against all specified fields. If a field differs, the virtual cluster creation will fail. If omitted, any node pool with the specified name is used. If a node pool with the specified name does not exist, Dataproc create a node pool with default values. This is an input only field. It will not be returned by the API. |
Role
Role
specifies the tasks that will run on the node pool. Roles can be specific to workloads. Exactly one GkeNodePoolTarget
within the virtual cluster must have the DEFAULT
role, which is used to run all workloads that are not associated with a node pool.
Enums | |
---|---|
ROLE_UNSPECIFIED |
Role is unspecified. |
DEFAULT |
At least one node pool must have the DEFAULT role. Work assigned to a role that is not associated with a node pool is assigned to the node pool with the DEFAULT role. For example, work assigned to the CONTROLLER role will be assigned to the node pool with the DEFAULT role if no node pool has the CONTROLLER role. |
CONTROLLER |
Run work associated with the Dataproc control plane (for example, controllers and webhooks). Very low resource requirements. |
SPARK_DRIVER |
Run work associated with a Spark driver of a job. |
SPARK_EXECUTOR |
Run work associated with a Spark executor of a job. |
HadoopJob
A Dataproc job for running Apache Hadoop MapReduce jobs on Apache Hadoop YARN.
Fields | |
---|---|
args[] |
Optional. The arguments to pass to the driver. Do not include arguments, such as |
jar_file_uris[] |
Optional. Jar file URIs to add to the CLASSPATHs of the Hadoop driver and tasks. |
file_uris[] |
Optional. HCFS (Hadoop Compatible Filesystem) URIs of files to be copied to the working directory of Hadoop drivers and distributed tasks. Useful for naively parallel tasks. |
archive_uris[] |
Optional. HCFS URIs of archives to be extracted in the working directory of Hadoop drivers and tasks. Supported file types: .jar, .tar, .tar.gz, .tgz, or .zip. |
properties |
Optional. A mapping of property names to values, used to configure Hadoop. Properties that conflict with values set by the Dataproc API may be overwritten. Can include properties set in /etc/hadoop/conf/*-site and classes in user code. |
logging_config |
Optional. The runtime log config for job execution. |
Union field driver . Required. Indicates the location of the driver's main class. Specify either the jar file that contains the main class or the main class name. To specify both, add the jar file to jar_file_uris , and then specify the main class name in this property. driver can be only one of the following: |
|
main_jar_file_uri |
The HCFS URI of the jar file containing the main class. Examples: 'gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar' 'hdfs:/tmp/test-samples/custom-wordcount.jar' 'file:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar' |
main_class |
The name of the driver's main class. The jar file containing the class must be in the default CLASSPATH or specified in |
HiveJob
A Dataproc job for running Apache Hive queries on YARN.
Fields | |
---|---|
continue_on_failure |
Optional. Whether to continue executing queries if a query fails. The default value is |
script_variables |
Optional. Mapping of query variable names to values (equivalent to the Hive command: |
properties |
Optional. A mapping of property names and values, used to configure Hive. Properties that conflict with values set by the Dataproc API may be overwritten. Can include properties set in /etc/hadoop/conf/*-site.xml, /etc/hive/conf/hive-site.xml, and classes in user code. |
jar_file_uris[] |
Optional. HCFS URIs of jar files to add to the CLASSPATH of the Hive server and Hadoop MapReduce (MR) tasks. Can contain Hive SerDes and UDFs. |
Union field queries . Required. The sequence of Hive queries to execute, specified as either an HCFS file URI or a list of queries. queries can be only one of the following: |
|
query_file_uri |
The HCFS URI of the script that contains Hive queries. |
query_list |
A list of queries. |
IdentityConfig
Identity related configuration, including service account based secure multi-tenancy user mappings.
Fields | |
---|---|
user_service_account_mapping |
Required. Map of user to service account. |
InstanceGroupAutoscalingPolicyConfig
Configuration for the size bounds of an instance group, including its proportional size to other groups.
Fields | |
---|---|
min_instances |
Optional. Minimum number of instances for this group. Primary workers - Bounds: [2, max_instances]. Default: 2. Secondary workers - Bounds: [0, max_instances]. Default: 0. |
max_instances |
Required. Maximum number of instances for this group. Required for primary workers. Note that by default, clusters will not use secondary workers. Required for secondary workers if the minimum secondary instances is set. Primary workers - Bounds: [min_instances, ). Secondary workers - Bounds: [min_instances, ). Default: 0. |
weight |
Optional. Weight for the instance group, which is used to determine the fraction of total workers in the cluster from this instance group. For example, if primary workers have weight 2, and secondary workers have weight 1, the cluster will have approximately 2 primary workers for each secondary worker. The cluster may not reach the specified balance if constrained by min/max bounds or other autoscaling settings. For example, if If weight is not set on any instance group, the cluster will default to equal weight for all groups: the cluster will attempt to maintain an equal number of workers in each group within the configured size bounds for each group. If weight is set for one group only, the cluster will default to zero weight on the unset group. For example if weight is set only on primary workers, the cluster will use primary workers only and no secondary workers. |
InstanceGroupConfig
The config settings for Compute Engine resources in an instance group, such as a master or worker group.
Fields | |
---|---|
num_instances |
Optional. The number of VM instances in the instance group. For HA cluster master_config groups, must be set to 3. For standard cluster master_config groups, must be set to 1. |
instance_names[] |
Output only. The list of instance names. Dataproc derives the names from |
image_uri |
Optional. The Compute Engine image resource used for cluster instances. The URI can represent an image or image family. Image examples:
Image family examples. Dataproc will use the most recent image from the family:
If the URI is unspecified, it will be inferred from |
machine_type_uri |
Optional. The Compute Engine machine type used for cluster instances. A full URL, partial URI, or short name are valid. Examples:
Auto Zone Exception: If you are using the Dataproc Auto Zone Placement feature, you must use the short name of the machine type resource, for example, |
disk_config |
Optional. Disk option config settings. |
is_preemptible |
Output only. Specifies that this instance group contains preemptible instances. |
preemptibility |
Optional. Specifies the preemptibility of the instance group. The default value for master and worker groups is The default value for secondary instances is |
managed_group_config |
Output only. The config for Compute Engine Instance Group Manager that manages this group. This is only used for preemptible instance groups. |
accelerators[] |
Optional. The Compute Engine accelerator configuration for these instances. |
min_cpu_platform |
Optional. Specifies the minimum cpu platform for the Instance Group. See Dataproc -> Minimum CPU Platform. |
Preemptibility
Controls the use of preemptible instances within the group.
Enums | |
---|---|
PREEMPTIBILITY_UNSPECIFIED |
Preemptibility is unspecified, the system will choose the appropriate setting for each instance group. |
NON_PREEMPTIBLE |
Instances are non-preemptible. This option is allowed for all instance groups and is the only valid value for Master and Worker instance groups. |
PREEMPTIBLE |
Instances are preemptible. This option is allowed only for secondary worker groups. |
SPOT |
Instances are Spot VMs. This option is allowed only for secondary worker groups. Spot VMs are the latest version of preemptible VMs, and provide additional features. |
InstantiateInlineWorkflowTemplateRequest
A request to instantiate an inline workflow template.
Fields | |
---|---|
parent |
Required. The resource name of the region or location, as described in https://cloud.google.com/apis/design/resource_names.
Authorization requires the following IAM permission on the specified resource
|
template |
Required. The workflow template to instantiate. |
request_id |
Optional. A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries. It is recommended to always set this value to a UUID. The tag must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), and hyphens (-). The maximum length is 40 characters. |
InstantiateWorkflowTemplateRequest
A request to instantiate a workflow template.
Fields | |
---|---|
name |
Required. The resource name of the workflow template, as described in https://cloud.google.com/apis/design/resource_names.
Authorization requires the following IAM permission on the specified resource
|
version |
Optional. The version of workflow template to instantiate. If specified, the workflow will be instantiated only if the current version of the workflow template has the supplied version. This option cannot be used to instantiate a previous version of workflow template. |
request_id |
Optional. A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries. It is recommended to always set this value to a UUID. The tag must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), and hyphens (-). The maximum length is 40 characters. |
parameters |
Optional. Map from parameter names to values that should be used for those parameters. Values may not exceed 1000 characters. |
Job
A Dataproc job resource.
Fields | |
---|---|
reference |
Optional. The fully qualified reference to the job, which can be used to obtain the equivalent REST path of the job resource. If this property is not specified when a job is created, the server generates a
. |
placement |
Required. Job information, including how, when, and where to run the job. |
status |
Output only. The job status. Additional application-specific status information may be contained in the
and
fields. |
status_history[] |
Output only. The previous job status. |
yarn_applications[] |
Output only. The collection of YARN applications spun up by this job. Beta Feature: This report is available for testing purposes only. It may be changed before final release. |
driver_output_resource_uri |
Output only. A URI pointing to the location of the stdout of the job's driver program. |
driver_control_files_uri |
Output only. If present, the location of miscellaneous control files which may be used as part of job setup and handling. If not present, control files may be placed in the same location as |
labels |
Optional. The labels to associate with this job. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a job. |
scheduling |
Optional. Job scheduling configuration. |
job_uuid |
Output only. A UUID that uniquely identifies a job within the project over time. This is in contrast to a user-settable reference.job_id that may be reused over time. |
done |
Output only. Indicates whether the job is completed. If the value is |
driver_scheduling_config |
Optional. Driver scheduling configuration. |
Union field type_job . Required. The application/framework-specific portion of the job. type_job can be only one of the following: |
|
hadoop_job |
Optional. Job is a Hadoop job. |
spark_job |
Optional. Job is a Spark job. |
pyspark_job |
Optional. Job is a PySpark job. |
hive_job |
Optional. Job is a Hive job. |
pig_job |
Optional. Job is a Pig job. |
spark_r_job |
|