What is Cloud Dataproc?
Cloud Dataproc is a fast, easy-to-use, low-cost and fully managed service that lets you run the Apache Spark and Apache Hadoop ecosystem on Google Cloud Platform. Cloud Dataproc provisions big or small clusters rapidly, supports many popular job types, and is integrated with other Google Cloud Platform services, such as Cloud Storage and Stackdriver Logging, thus helping you reduce TCO.
How is Cloud Dataproc different from traditional Hadoop clusters?
Cloud Dataproc is a managed Spark/Hadoop service intended to make Spark and Hadoop easy, fast, and powerful. In a traditional Hadoop deployment, even one that is cloud-based, you must install, configure, administer, and orchestrate work on the cluster. By contrast, Cloud Dataproc handles cluster creation, management, monitoring, and job orchestration for you.
How can I use Cloud Dataproc?
There are a number of ways you can use a Cloud Dataproc cluster depending on
your needs and capabilities. You can use the browser-based
Google Cloud Platform Console to interact with Cloud Dataproc. Or, because Cloud Dataproc
is integrated with the Cloud SDK, you can use the
gcloud command-line tool. For programmatic access
to clusters, use the Cloud Dataproc REST API. You can also make SSH connections
to master or worker nodes in your cluster.
How does Cloud Dataproc work?
Cloud Dataproc is a managed framework that runs on the Google Cloud Platform and ties together several popular tools for processing data, including Apache Hadoop, Spark, Hive, and Pig. Cloud Dataproc has a set of control and integration mechanisms that coordinate the lifecycle, management, and coordination of clusters. Cloud Dataproc is integrated with the YARN application manager to make managing and using your clusters easier.
What type of jobs can I run?
Cloud Dataproc provides out-of-the box and end-to-end support for many of the most popular job types, including Spark, Spark SQL, PySpark, MapReduce, Hive, and Pig jobs.
What Cluster Manager does Cloud Dataproc use with Spark?
Cloud Dataproc runs Spark on YARN.
How frequently are the components in Cloud Dataproc updated?
Cloud Dataproc is updated when major releases occur in underlying components (Hadoop, Spark, Hive, Pig). Each major Cloud Dataproc release supports specific versions of each component. For a list of currently supported components and versions, see the [Cloud Dataproc documentation].
Is Cloud Dataproc integrated with other Google Cloud Platform products?
Yes, Cloud Dataproc has native and automatic integrations with Compute Engine, Cloud Storage, Cloud Bigtable, BigQuery, Logging, and Stackdriver Monitoring. Moreover, Cloud Dataproc is integrated into tools that interact with the Cloud Platform including the Cloud SDK and the Google Cloud Platform Console.
Can I run a persistent cluster?
Once started, Cloud Dataproc clusters continue to run until shut down. You can run a Cloud Dataproc cluster for as long as you need.
Can I run more than one cluster at a time?
Yes, you can run more than one Cloud Dataproc cluster per project simultaneously. By default, all projects are subject to Google Cloud resource quotas. You can easily check your quota usage and request an increase to your quota. For more information, see Cloud Dataproc resource quotas.
How can I create or destroy a cluster?
You can create and destroy clusters in several ways. The Cloud Dataproc sections in the Google Cloud Platform Console make it easy to manage clusters from your browser. Clusters can also be managed via the command line through the Cloud SDK. For more complex or advanced use cases, the Cloud Dataproc REST API can be used to programmatically manage clusters.
Can I apply customized settings when I create a cluster?
Cloud Dataproc supports initialization actions that are executed when a cluster is created. These initialization actions can be scripts or executables that Cloud Dataproc will run when provisioning your cluster to customize settings, install applications, or make other modifications to your cluster.
How do I size a cluster for my needs?
Cluster sizing decisions are influenced by several factors, including the type of work to be performed, cost constraints, speed requirements, and your resource quota. Since Cloud Dataproc can be deployed on a variety of machine types, you have the flexibility to choose the resources you need, when you need them.
Can I resize my cluster?
Yes, you can easily resize your cluster, even during job processing. You can resize your cluster through the Google Cloud Platform Console or through the command line. Resizing can increase or decrease the number of workers in a cluster. Workers added to a cluster will be the same type and size as existing workers. Resizing clusters is acceptable and supported except in special cases, such as reducing the number of workers to one or reducing HDFS capacity below the amount needed for job completion.
Job and workflow management
How can I submit jobs on my cluster?
There are several ways to submit jobs on a Cloud Dataproc cluster. The easiest way is to use the Cloud Dataproc Submit a job page on the Google Cloud Platform Console or the Cloud SDK gcloud dataproc jobs submit command. For programmatic job submission, see the Cloud Dataproc API reference.
Can I run more than one job at a time?
Yes, you can run more than one job at a time on a Cloud Dataproc cluster. Cloud Dataproc utilizes a resource manager (YARN) and application-specific configurations, such as scaling with Spark, to optimize the use of resources on a cluster. Job performance will scale with cluster size and the number of active jobs.
Can I cancel jobs on my cluster?
Definitely. Jobs can be canceled via the Google Cloud Platform Console web interface or the command line. Cloud Dataproc utilizes YARN application cancellation to stop jobs upon request.
Can I automate jobs on my cluster?
Jobs can be automated to run on clusters through several mechanisms. You can use
the Cloud SDK
gcloud command-line tool or the Cloud Dataproc REST
APIs to automate the management and workflow of clusters and jobs.
What development languages are supported?
You can use languages supported by the Spark/Hadoop ecosystem, including Java, Scala, Python, and R.
Does Cloud Dataproc have an API?
Yes, Cloud Dataproc has a set of RESTful APIs that allow you to programmatically interact with clusters and jobs.
Can I SSH into a cluster?
Yes, you can SSH into every machine (master or worker node) within a cluster. You can SSH from a browser or from the command line.
Can I access the Spark/Hadoop Web UIs?
Yes, the Hadoop and Spark UIs (Spark, Hadoop, YARN UIs) are accessible within a cluster. Rather than opening ports for the UIs, we recommend using an SSH tunnel, which will securely forward traffic from clusters over the SSH connection.
Can I install or manage software on my cluster?
Yes, as with a Hadoop cluster or server, you can install and manage software on a Cloud Dataproc cluster.
What is the default replication factor?
Due to performance considerations as well as the high reliability of storage attached to Cloud Dataproc clusters, the default replication factor is set at 2.
What operating system (OS) is used for Cloud Dataproc?
Cloud Dataproc is based on Debian. The latest images are based on Debian 9 Stretch.
Where can I learn about Hadoop streaming?
You can review the Apache project documentation.
How do I install the gcloud dataproc command?
When you install the Cloud SDK, the standard
command-line tool is installed, including
gcloud dataproc commands.
Data access & availability
How can I get data in and out of a cluster?
Cloud Dataproc utilizes the Hadoop Distributed File System (HDFS) for storage. Additionally, Cloud Dataproc automatically installs the HDFS-compatible Google Cloud Storage connector, which enables the use of Cloud Storage in parallel with HDFS. Data can be moved in and out of a cluster through upload/download to HDFS or Cloud Storage.
Can I use Cloud Storage with Dataproc?
Yes, Cloud Dataproc clusters automatically install the Cloud Storage connector. There are a number of benefits to choosing Cloud Storage over traditional HDFS including data persistence, reliability, and performance.
Can I get Cloud Storage Connector support?
Yes, when used with Cloud Dataproc, the Cloud Storage connector
is supported at the same level as Cloud Dataproc (see
Getting support). All connector users
can use the
google-cloud-dataproc tag on
for connector questions and answers.
How reliable is Cloud Dataproc?
Because Cloud Dataproc is built on reliable and proven Google Cloud Platform technologies, including Compute Engine, Cloud Storage, and Monitoring, it is designed for high availability and reliability. As a generally available product, you can review the Cloud Dataproc SLA.
What happens to my data when a cluster is shut down?
Any data in Cloud Storage persists after your cluster is shut down. This is one of the reasons to choose Cloud Storage over HDFS since HDFS data is removed when a cluster is shut down (unless it is transferred to a persistent location prior to shutdown).
Logging, monitoring, & debugging
What sort of logging and monitoring is available?
By default, Cloud Dataproc clusters are integrated with Monitoring and Logging. Monitoring and Logging make it easy to get detailed information about the health, performance, and status of your Cloud Dataproc clusters. Both application (YARN, Spark, etc.) and system logs are forwarded to Logging.
How can I view logs from Cloud Dataproc?
You can view logs from Cloud Dataproc in several ways. You can visit Logging to view aggregated cluster logs in a web browser. Additionally, you can use the command-line (SSH) to manually view logs or monitor application outputs. Finally, details are also available via the Hadoop application web UIs, such as the YARN web interface.
How can clusters be monitored?
Clusters can be easily monitored through Monitoring or the Cloud Dataproc section of the Google Cloud Platform Console. You can also monitor your clusters through command-line (SSH) access or the application (Spark, YARN, etc.) web interfaces.
Security & access
How is my data secured?
Google Cloud Platform employs a rich security model, which also applies to Cloud Dataproc. Cloud Dataproc provides authentication, authorization, and encryption mechanisms, such as SSL, to secure data. Data can be user encrypted in transit to and from a cluster, upon cluster creation or job submission.
How can I control access to my Cloud Dataproc cluster?
Google Cloud Platform offers authentication mechanisms, which can be used with Cloud Dataproc. Access to Cloud Dataproc clusters and jobs can be granted to users at the project level.
How is Cloud Dataproc billed?
Cloud Dataproc is billed by the second, and is based on the size of a cluster and the length of time the cluster is operational. In computing the cluster component of the fee, Cloud Dataproc charges a flat fee based on the number of virtual CPUs (vCPUs) in a cluster. This flat fee is the same regardless of the machine type or size of the Compute Engine resources used. Billing for Cloud Dataproc does not include the charges for Compute Engine or other Cloud resources used with a cluster.
Am I charged for other Google Cloud resources?
Yes, running a Cloud Dataproc cluster incurs charges for other Google Cloud resources used in the cluster, such as Compute Engine and Cloud Storage. Each item is stated separately in your bill, so you know exactly how your costs are calculated and allocated.
Is there a minimum or maximum time for billing?
Google Cloud charges are calculated by the second, not by the hour. Currently, Compute Engine has a 1-minute minimum billing increment. Therefore, Cloud Dataproc also has a 1-minute minimum billing increment.
Who can create a Cloud Dataproc cluster?
Cloud Dataproc is generally available which means all Google Cloud Platform customers can use it.
In which regions is Cloud Dataproc available?
Cloud Dataproc is available across all regions and zones of the Google Cloud platform.