- Automated Cluster Management
- Managed deployment, logging, and monitoring let you focus on your data, not on your cluster. Your clusters will be stable, scalable, and speedy.
- Resizable Clusters
- Clusters can be created and scaled quickly with a variety of virtual machine types, disk sizes, number of nodes, and networking options.
- Built-in integration with Cloud Storage, BigQuery, Bigtable, Stackdriver Logging, and Stackdriver Monitoring, giving you a complete and robust data platform.
- Image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop, and other tools.
- Highly available
- Run clusters with multiple master nodes and set jobs to restart on failure to ensure your clusters and jobs are highly available.
- Developer Tools
- Multiple ways to manage a cluster, including an easy-to-use Web UI, the Google Cloud SDK, RESTful APIs, and SSH access.
- Initialization Actions
- Run initialization actions to install or customize the settings and libraries you need when your cluster is created.
- Automatic or Manual Configuration
- Cloud Dataproc automatically configures hardware and software on clusters for you while also allowing for manual control.
- Flexible Virtual Machines
- Clusters can use custom machine types and preemptible virtual machines so they are the perfect size for your needs.
Cloud Dataflow vs. Cloud Dataproc: Which should you use?
|WORKLOADS||CLOUD DATAPROC||CLOUD DATAFLOW|
|Stream processing (ETL)||check|
|Batch processing (ETL)||check||check|
|Iterative processing and notebooks||check|
|Machine learning with Spark ML||check|
|Preprocessing for machine learning||check (with Cloud ML Engine)|