Data Analytics

Connect Spark data pipelines to Gemini and other AI models with Dataproc ML library

October 3, 2025

Brad Miro

Developer Advocate

Deependra Patel

Senior Software Engineer

Many data science teams rely on Apache Spark running on Dataproc managed clusters for powerful, large-scale data preparation. As these teams look to connect their data pipelines directly to machine learning models, there's a clear opportunity to simplify the integration. But running inference on a Spark DataFrame using a model from Vertex AI typically requires custom development, making it complex to build a single, end-to-end workflow.

To solve this problem, we are developing a new open-source Python library designed to simplify AI/ML inference for Dataproc. This library connects your Apache Spark jobs to use popular ML frameworks and Vertex AI features, starting with model inference. Because the library is open-sourced, you will be able to use it directly in your application code with full transparency into its operation.

How it works

Dataproc ML is built to feel familiar to Spark users, following a SparkML-style builder pattern. You configure the model you want to use, and then call .transform() on your DataFrame. Let's look at a few common inference use cases.

Apply Gemini models to your Spark data

You can apply generative AI models, like Gemini, to columns in your Spark DataFrame. This is useful for tasks like classification, extraction, or summarization at scale. In this example, we take a DataFrame with "city" & “country” columns and use Gemini to create a new column by providing a simple prompt.

You can test in your local environment by installing from PyPi:

To deploy/test at scale, create a Dataproc version 2.3-ml cluster:

Copy this example to a file gemini_spark.py.

lang-py

from pyspark.sql import SparkSession
from google.cloud.dataproc_ml.inference import GenAiModelHandler

spark = SparkSession.builder.getOrCreate()

# Create a sample DataFrame
df = spark.createDataFrame([
    ("London", "UK"),
    ("Bengaluru", "India"),
    ("Paris", "France"),
    ("Tokyo", "Japan")
], ["city", "country"])

# Configure the model handler. It uses gemini-2.5-flash by default.
genai_handler = GenAiModelHandler().prompt(
    "Write a short, one-line rhyming poem about the experience of visiting {city} in {country}."
)

# Apply the model, which will output to a new `predictions` column
genai_handler.transform(df).show(truncate=False)

The handler is flexible to support customized options, as explained in the documentation.

lang-py

Submit this job to your Dataproc cluster:

2. Run inference with PyTorch and TensorFlow models

In addition to calling Gemini endpoints, the library also allows you to run inference with model files loaded directly from Google Cloud Storage. You can use the PyTorchModelHandler (and a similar handler for TensorFlow) to load your model weights, define a pre-processor, and run inference directly on your worker nodes. This is useful when you want to run batch inference at scale without managing a separate model serving endpoint.

lang-py

Built for performance

This library isn't just a simple wrapper. It’s designed for running inference on large Dataproc clusters and includes several optimizations for inference:

Vectorized data transfer: We use pandas_udf to efficiently move data between Spark and the Python worker processes.
Connection re-use: Connections to the endpoint are re-used across partitions to reduce overhead.
Retry logic: The library automatically handles errors like HTTP 429 (resource exhausted) with exponential backoff and retries.

Get started

You can start using it today by checking out the open-source repository and reading our documentation.

Looking ahead, we plan to add the following features to this library in the coming months.

Spark Connect support: This would also allow using above functionalities within BigQuery Studio notebooks.
Vertex AI integrations: To ease inference, we plan to add more ModelHandlers to:
1. Directly call a vertex model endpoint for online inference
2. Refer to Vertex models and localize them to Spark workers
3. Refer to models hosted in Vertex Model Garden including embedding models
More Optimizations: Auto-repartition input dataframes to enhance inference runtime
Third-party integrations: Refer to open sourced models in HuggingFace

We are actively working on including this library by default in Dataproc on Google Compute Engine ML images and Google Cloud Serverless for Apache Spark runtimes.

We look forward to seeing what you build! Have feedback or feature requests to further simplify your AI/ML experience on spark? Reach us at dataproc-feedback@google.com.

Posted in