Dataproc client libraries

This page shows how to get started with the Cloud Client Libraries for the Dataproc API. However, we recommend using the older Google API Client Libraries if running on Google App Engine standard environment. Read more about the client libraries for Cloud APIs in Client Libraries Explained.

Dataproc Cloud Client Libraries may be in alpha or beta stage. See the library reference for details.

Installing the client library

C#

For more information, see Setting Up a C# Development Environment.

Also see Google.Cloud.Dataproc.V1 Installation

Go

For more information, see Setting Up a Go Development Environment.

go get -u cloud.google.com/go/dataproc/apiv1

For more information, see Install the Cloud Client Libraries for Go.

Java

For more information, see Setting Up a Java Development Environment.

If you are using Maven, add this to your pom.xml file:

<dependency>
    <groupId>com.google.cloud</groupId>
    <artifactId>google-cloud-dataproc</artifactId>
    <version>insert dataproc-library-version here</version>
</dependency>

If you are using Gradle, add this to your dependencies:

compile group: 'com.google.cloud', name: 'google-cloud-dataproc', version: 'insert dataproc-library-version here'

Node.js

For more information, see Setting Up a Node.js Development Environment.

npm install --save @google-cloud/dataproc

PHP

For more information, see Using PHP on Google Cloud.

composer require google/cloud

Python

For more information, see Setting Up a Python Development Environment.

pip install --upgrade google-cloud-dataproc

Ruby

For more information, see Setting Up a Ruby Development Environment.

gem install google-cloud-dataproc

Setting up authentication

To run the client library, you must first set up authentication by creating a service account and setting an environment variable. Complete the following steps to set up authentication. For other ways to authenticate, see the GCP authentication documentation.

Cloud Console

Create a service account:

  1. In the Cloud Console, go to the Create service account page.

    Go to Create service account
  2. Select a project.
  3. In the Service account name field, enter a name. The Cloud Console fills in the Service account ID field based on this name.

    In the Service account description field, enter a description. For example, Service account for quickstart.

  4. Click Create.
  5. Click the Select a role field.

    Under Quick access, click Basic, then click Owner.

  6. Click Continue.
  7. Click Done to finish creating the service account.

    Do not close your browser window. You will use it in the next step.

Create a service account key:

  1. In the Cloud Console, click the email address for the service account that you created.
  2. Click Keys.
  3. Click Add key, then click Create new key.
  4. Click Create. A JSON key file is downloaded to your computer.
  5. Click Close.

Command line

You can run the following commands using the Cloud SDK on your local machine, or in Cloud Shell.

  1. Create the service account. Replace NAME with a name for the service account.

    gcloud iam service-accounts create NAME
  2. Grant permissions to the service account. Replace PROJECT_ID with your project ID.

    gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:NAME@PROJECT_ID.iam.gserviceaccount.com" --role="roles/owner"
  3. Generate the key file. Replace FILE_NAME with a name for the key file.

    gcloud iam service-accounts keys create FILE_NAME.json --iam-account=NAME@PROJECT_ID.iam.gserviceaccount.com

Provide authentication credentials to your application code by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS. This variable only applies to your current shell session, so if you open a new session, set the variable again.

Linux or macOS

export GOOGLE_APPLICATION_CREDENTIALS="KEY_PATH"

Replace KEY_PATH with the path of the JSON file that contains your service account key.

For example:

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"

Windows

For PowerShell:

$env:GOOGLE_APPLICATION_CREDENTIALS="KEY_PATH"

Replace KEY_PATH with the path of the JSON file that contains your service account key.

For example:

$env:GOOGLE_APPLICATION_CREDENTIALS="C:\Users\username\Downloads\service-account-file.json"

For command prompt:

set GOOGLE_APPLICATION_CREDENTIALS=KEY_PATH

Replace KEY_PATH with the path of the JSON file that contains your service account key.

Using the client library

The following example shows how to use the client library.

Go

Before trying this sample, follow the Go setup instructions in the Dataproc quickstart using client libraries. For more information, see the Dataproc Go API reference documentation.

import (
	"context"
	"fmt"
	"io"

	dataproc "cloud.google.com/go/dataproc/apiv1"
	"google.golang.org/api/option"
	dataprocpb "google.golang.org/genproto/googleapis/cloud/dataproc/v1"
)

func createCluster(w io.Writer, projectID, region, clusterName string) error {
	// projectID := "your-project-id"
	// region := "us-central1"
	// clusterName := "your-cluster"
	ctx := context.Background()

	// Create the cluster client.
	endpoint := region + "-dataproc.googleapis.com:443"
	clusterClient, err := dataproc.NewClusterControllerClient(ctx, option.WithEndpoint(endpoint))
	if err != nil {
		return fmt.Errorf("dataproc.NewClusterControllerClient: %v", err)
	}
	defer clusterClient.Close()

	// Create the cluster config.
	req := &dataprocpb.CreateClusterRequest{
		ProjectId: projectID,
		Region:    region,
		Cluster: &dataprocpb.Cluster{
			ProjectId:   projectID,
			ClusterName: clusterName,
			Config: &dataprocpb.ClusterConfig{
				MasterConfig: &dataprocpb.InstanceGroupConfig{
					NumInstances:   1,
					MachineTypeUri: "n1-standard-2",
				},
				WorkerConfig: &dataprocpb.InstanceGroupConfig{
					NumInstances:   2,
					MachineTypeUri: "n1-standard-2",
				},
			},
		},
	}

	// Create the cluster.
	op, err := clusterClient.CreateCluster(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateCluster: %v", err)
	}

	resp, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("CreateCluster.Wait: %v", err)
	}

	// Output a success message.
	fmt.Fprintf(w, "Cluster created successfully: %s", resp.ClusterName)
	return nil
}

Java

Before trying this sample, follow the Java setup instructions in the Dataproc quickstart using client libraries. For more information, see the Dataproc Java API reference documentation.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.dataproc.v1.Cluster;
import com.google.cloud.dataproc.v1.ClusterConfig;
import com.google.cloud.dataproc.v1.ClusterControllerClient;
import com.google.cloud.dataproc.v1.ClusterControllerSettings;
import com.google.cloud.dataproc.v1.ClusterOperationMetadata;
import com.google.cloud.dataproc.v1.InstanceGroupConfig;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class CreateCluster {

  public static void createCluster() throws IOException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String region = "your-project-region";
    String clusterName = "your-cluster-name";
    createCluster(projectId, region, clusterName);
  }

  public static void createCluster(String projectId, String region, String clusterName)
      throws IOException, InterruptedException {
    String myEndpoint = String.format("%s-dataproc.googleapis.com:443", region);

    // Configure the settings for the cluster controller client.
    ClusterControllerSettings clusterControllerSettings =
        ClusterControllerSettings.newBuilder().setEndpoint(myEndpoint).build();

    // Create a cluster controller client with the configured settings. The client only needs to be
    // created once and can be reused for multiple requests. Using a try-with-resources
    // closes the client, but this can also be done manually with the .close() method.
    try (ClusterControllerClient clusterControllerClient =
        ClusterControllerClient.create(clusterControllerSettings)) {
      // Configure the settings for our cluster.
      InstanceGroupConfig masterConfig =
          InstanceGroupConfig.newBuilder()
              .setMachineTypeUri("n1-standard-2")
              .setNumInstances(1)
              .build();
      InstanceGroupConfig workerConfig =
          InstanceGroupConfig.newBuilder()
              .setMachineTypeUri("n1-standard-2")
              .setNumInstances(2)
              .build();
      ClusterConfig clusterConfig =
          ClusterConfig.newBuilder()
              .setMasterConfig(masterConfig)
              .setWorkerConfig(workerConfig)
              .build();
      // Create the cluster object with the desired cluster config.
      Cluster cluster =
          Cluster.newBuilder().setClusterName(clusterName).setConfig(clusterConfig).build();

      // Create the Cloud Dataproc cluster.
      OperationFuture<Cluster, ClusterOperationMetadata> createClusterAsyncRequest =
          clusterControllerClient.createClusterAsync(projectId, region, cluster);
      Cluster response = createClusterAsyncRequest.get();

      // Print out a success message.
      System.out.printf("Cluster created successfully: %s", response.getClusterName());

    } catch (ExecutionException e) {
      System.err.println(String.format("Error executing createCluster: %s ", e.getMessage()));
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the Dataproc quickstart using client libraries. For more information, see the Dataproc Node.js API reference documentation.

.
const dataproc = require('@google-cloud/dataproc');

// TODO(developer): Uncomment and set the following variables
// projectId = 'YOUR_PROJECT_ID'
// region = 'YOUR_CLUSTER_REGION'
// clusterName = 'YOUR_CLUSTER_NAME'

// Create a client with the endpoint set to the desired cluster region
const client = new dataproc.v1.ClusterControllerClient({
  apiEndpoint: `${region}-dataproc.googleapis.com`,
  projectId: projectId,
});

async function createCluster() {
  // Create the cluster config
  const request = {
    projectId: projectId,
    region: region,
    cluster: {
      clusterName: clusterName,
      config: {
        masterConfig: {
          numInstances: 1,
          machineTypeUri: 'n1-standard-2',
        },
        workerConfig: {
          numInstances: 2,
          machineTypeUri: 'n1-standard-2',
        },
      },
    },
  };

  // Create the cluster
  const [operation] = await client.createCluster(request);
  const [response] = await operation.promise();

  // Output a success message
  console.log(`Cluster created successfully: ${response.clusterName}`);

Python

Before trying this sample, follow the Python setup instructions in the Dataproc quickstart using client libraries. For more information, see the Dataproc Python API reference documentation.

from google.cloud import dataproc_v1 as dataproc


def create_cluster(project_id, region, cluster_name):
    """This sample walks a user through creating a Cloud Dataproc cluster
       using the Python client library.

       Args:
           project_id (string): Project to use for creating resources.
           region (string): Region where the resources should live.
           cluster_name (string): Name to use for creating a cluster.
    """

    # Create a client with the endpoint set to the desired cluster region.
    cluster_client = dataproc.ClusterControllerClient(
        client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    )

    # Create the cluster config.
    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
            "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
        },
    }

    # Create the cluster.
    operation = cluster_client.create_cluster(
        request={"project_id": project_id, "region": region, "cluster": cluster}
    )
    result = operation.result()

    # Output a success message.
    print(f"Cluster created successfully: {result.cluster_name}")

Additional resources