Dataproc Hub makes notebooks easier to use for machine learning
Susheel Kaushik
Product Manager, Data Analytics
Dataproc is a fast, easy-to-use, fully managed cloud service for running open source, such as Apache Spark, Presto, and Apache Hadoop clusters, in a simpler, more cost-efficient way. Today, with the general availability of Dataproc Hub, and the launch of our machine learning initialization action, we are making it easier for data scientists to use IT-governed, open source notebook-based machine learning with horizontally scalable compute, powered by Spark.
Our enterprise customers running machine learning on Dataproc require role separation between IT and data scientists. With Dataproc Hub, IT administrators can pre-approve and create Dataproc configurations to meet cost and governance constraints. Data scientists can then create personal workspaces backed by IT pre-approved configurations to spin up scalable distributed Dataproc clusters with a single click. Jupyter Notebooks enable data scientists to interactively explore and prepare the data and train their models using Spark and additional OSS machine learning libraries. These on-demand Dataproc clusters can be configured with auto-scale and auto-deletion policies and can be started and stopped manually or automatically. We have received very positive feedback from our enterprise customers especially on the role separation, and we want to make Dataproc setup even easier with the new machine learning initialization action.
Having worked with enterprises across industries, we have observed common requirements for Dataproc data science configurations that we are now packaging together in our machine learning initialization action. You can further customize the initialization action and add your own libraries to build a custom image. This simplifies Dataproc ML cluster creation while providing data scientists a cluster with:
Python packages such as TensorFlow, PyTorch, MxNet, Scikit-learn, and Keras
R packages including XGBoost, Caret, randomForest, and sparklyr
Spark-BigQuery Connector: Spark connector to read and write data from and to BigQuery
Dask and Dask-Yarn: Dask is a Python library for parallel computing with similar APIs to the most popular Python data science libraries, such as Pandas, NumPy, and scikit-learn, enabling data scientists to use the standard Python at scale. (There's a Dask initialization available for Dataproc.)
RAPIDS on Spark (optionally): RAPIDS Accelerator for Apache Spark combines the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. Accelerated shuffle configuration leverages GPU-GPU communications and RDMA capabilities to deliver reduced latency and costs for select ML workloads
K80, P100, V100, P4, or T4 Nvidia GPUs and drivers (optional)
Considerations when building a Dataproc cluster for machine learning
Data scientists predominantly infer business events from the data events. Data scientists then, in collaboration with business owners, develop hypotheses and build models leveraging machine learning to generate actionable insights. Ability to understand how business events translate to data events is a critical factor for success. Our enterprise users need to consider many factors prior to selecting the appropriate Dataproc OSS machine learning environment. Points of consideration include:
Data access: Data scientists need access to long-term historical data to make business event inference and generate actionable insights. Access to data at scale in proximity to the processing environment is critical for large-scale analysis and machine learning.
Dataproc includes predefined open source connectors to access data on Cloud Storage and on BigQuery storage. Using these connectors, Dataproc Spark jobs can seamlessly access data on Cloud Storage in various open source data formats (Avro, Parquet, CSV, and many more) and also data from BigQuery storage in native BigQuery format.
Infrastructure: Data scientists need the flexibility to select the appropriate compute infrastructure for machine learning. This compute infrastructure comprises VM type selection, associated memory, and attached GPUs and TPUs for accelerated processing. Ability to select from a wide range of options is critical for optimizing for performance, results, and costs.
Dataproc provides the ability to attach K80, P100, V100, P4, or T4 Nvidia GPUs to Dataproc compute VMs. RAPIDs libraries leverage these GPUs to deliver performance boost to select Spark workloads.
Processing environment: There are many open source machine learning processing environments such as Spark ML, DASK, RAPIDS, Python, R, TensorFlow, and more. Usually data scientists do have a preference, so we’re focused on enabling as many of the open source processing environments as possible. At the same time, data scientists usually add custom libraries to enhance their data processing and machine learning capabilities.
Dataproc supports Spark and DASK processing frameworks for running machine learning at scale. Spark ML comes with standard implementations of machine learning algorithms, and you can utilize them on your datasets already stored on Cloud Storage or BigQuery. Some data scientists prefer ML implementations from Python libraries for building their models. Essentially, swapping a couple of statements enables you to switch from standard Python libraries to DASK. You can select the appropriate processing environment to suit your specific machine learning needs.
Orchestration: Many iterations are required in an ML workflow because of model refinement or retuning. Data scientists need a simple approach to automate data processing and machine learning graphs. One such design pattern is building a machine learning pipeline for modeling and another approach is scheduling the notebook used in interactive modeling.
Dataproc workflow templates enable you to create simple workflows and Cloud Composer can be used to orchestrate complex machine learning pipelines.
Metadata management: Dataproc Metastore enables you to store the associated business metadata with the table metadata for easy discovery and communication. Dataproc Metastore, currently in private preview, enables a unified view of open source tables across Google Cloud.
Notebook user experience: Notebooks allow you to interactively run workloads on Dataproc clusters. Data scientists have two options to use Notebooks on Dataproc:
You can use Dataproc Hub to spin up a personal cluster with Jupyter Notebook experience using IT pre-approved configurations with one click. IT administrators can select the appropriate processing environment (Spark or DASK), the compute environment (VM type, cores, and memory configuration) and optionally also attach GPU accelerators along with RAPIDS for performance gains for some machine learning workloads. For cost optimizations, IT administrators can configure auto-scaling and auto-deletion policies and data scientists at the same time can manually stop the cluster when not in use.
You can configure your own Dataproc cluster, selecting the appropriate processing environment and compute environment along with the notebook experience (Jupyter and Zeppelin) using Component Gateway.
Data scientists need a deep understanding of how data represents business transactions and events and the ability to leverage the innovation in OSS machine learning and deep learning, Notebooks, and Dataproc Hub to deliver actionable insights. We at Google focus on understanding the complexity and limitations of the underlying framework, OSS, and infrastructure capabilities and are actively working to simplify the OSS machine learning experience so that you can focus more on understanding your business and generating actionable insights and less on managing the tools and capabilities used to generate them.
Check out Dataproc, let us know what you think, and help us build the next-generation OSS machine learning experience that is simple, customizable, and easy to use.