Set up gcloud
To configure the Google Cloud CLI for your Cloud Run service:
Set your default project:
gcloud config set project PROJECT_ID
Click the icon to replace the variable PROJECT_ID with the name of the project you created for this tutorial. This ensures that all listings on this page that reference PROJECT_ID have the correct value already filled in.
Configure Google Cloud CLI to use the region
europe-west1
for Cloud Run commands.gcloud config set run/region europe-west1
Use Docker to create a container image with Ollama and Gemma
Create a directory for the Ollama service and change your working directory to this new directory:
mkdir ollama-backend cd ollama-backend
Create a
Dockerfile
file with the following contents:FROM ollama/ollama:latest # Listen on all interfaces, port 8080 ENV OLLAMA_HOST 0.0.0.0:8080 # Store model weight files in /models ENV OLLAMA_MODELS /models # Reduce logging verbosity ENV OLLAMA_DEBUG false # Never unload model weights from the GPU ENV OLLAMA_KEEP_ALIVE -1 # Store the model weights in the container image ENV MODEL gemma3:4b RUN ollama serve & sleep 5 && ollama pull $MODEL # Start Ollama ENTRYPOINT ["ollama", "serve"]
Store model weights in the container image for faster instance starts
Google recommends storing the model weights for Gemma 3 (4B) and similarly sized models directly in the container image.
Model weights are the numerical parameters that define the behavior of an LLM. Ollama must fully read these files and load the weights into GPU memory (VRAM) during container instance startup, before it can start serving inference requests.
On Cloud Run, a fast container instance startup is important for minimizing request latency. If your container instance has a slow startup time, the service takes longer to scale from zero to one instance, and it needs more time to scale out during a traffic spike.
To ensure a fast startup, store the model files in the container image itself. This is faster and more reliable than downloading the files from a remote location during startup. Cloud Run's internal container image storage is optimized for handling traffic spikes, allowing it to quickly set up the container's file system when an instance starts.
Note that the model weights for Gemma 3 (4B) take up 8 GB of storage. Larger models have larger model weight files, and these might be impractical to store in the container image. Refer to Best practices: AI inference on Cloud Run with GPUs for an overview of the trade-offs.
Build and deploy the Cloud Run service
Build and deploy the service to Cloud Run:
gcloud run deploy ollama-gemma \
--source . \
--concurrency 4 \
--cpu 8 \
--set-env-vars OLLAMA_NUM_PARALLEL=4 \
--gpu 1 \
--gpu-type nvidia-l4 \
--max-instances 1 \
--memory 32Gi \
--no-allow-unauthenticated \
--no-cpu-throttling \
--no-gpu-zonal-redundancy \
--timeout=600
Note the following important flags in this command:
--concurrency 4
is set to match the value of the environment variableOLLAMA_NUM_PARALLEL
.--gpu 1
with--gpu-type nvidia-l4
assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.--max-instances 1
specifies the maximum number of instances to scale to. It has to be equal to or lower than your project's NVIDIA L4 GPU (Total Nvidia L4 GPU allocation, per project per region
) quota.--no-allow-unauthenticated
restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run's built-in Identity and Access Management (IAM) authentication for service-to-service communication. Refer to Managing access using IAM.--no-cpu-throttling
is required for enabling GPU.--no-gpu-zonal-redundancy
set zonal redundancy options depending on your zonal failover requirements and available quota. See GPU zonal redundancy options for details.
Setting concurrency for optimal performance
This section provides context on the recommended concurrency settings. For optimal
request latency, ensure the --concurrency
setting is equal to Ollama's
OLLAMA_NUM_PARALLEL
environment variable.
OLLAMA_NUM_PARALLEL
determines how many request slots are available per each model to handle inference requests concurrently.--concurrency
determines how many requests Cloud Run sends to an Ollama instance at the same time.
If --concurrency
exceeds OLLAMA_NUM_PARALLEL
, Cloud Run can send
more requests to a model in Ollama than it has available request slots for.
This leads to request queuing within Ollama, increasing request latency for the
queued requests. It also leads to less responsive auto scaling, as the queued
requests don't trigger Cloud Run to scale out and start new instances.
Ollama also supports serving multiple models from one GPU. To completely
avoid request queuing on the Ollama instance, you should still set
--concurrency
to match OLLAMA_NUM_PARALLEL
.
It's important to note that increasing OLLAMA_NUM_PARALLEL
also makes parallel requests take longer.
Optimizing utilization
For optimal GPU utilization, increase --concurrency
, keeping it within
twice the value of OLLAMA_NUM_PARALLEL
. While this leads to request queuing in Ollama, it can help improve utilization: Ollama instances can immediately process requests from their queue, and the queues help absorb traffic spikes.
Test the deployed Ollama service with curl
Now that you have deployed the Ollama service, you can send requests to it. However,
if you send a request directly, Cloud Run responds with HTTP 401 Unauthorized
.
This is intentional, because an LLM inference API is intended for other services to
call, such as a frontend application. For more information on service-to-service
authentication on Cloud Run, refer to Authenticating service-to-service.
To send requests to the Ollama service, add a header with a valid OIDC token to the requests, for example using the Cloud Run developer proxy:
Start the proxy, and when prompted to install the
cloud-run-proxy
component, chooseY
:gcloud run services proxy ollama-gemma --port=9090
Send a request to it in a separate terminal tab, leaving the proxy running. Note that the proxy runs on
localhost:9090
:curl http://localhost:9090/api/generate -d '{ "model": "gemma3:4b", "prompt": "Why is the sky blue?" }'
This command should provide streaming output similar to this:
{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.641492408Z","response":"That","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.687529153Z","response":"'","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.753284927Z","response":"s","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.812957381Z","response":" a","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.889102649Z","response":" fantastic","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.925748116Z","response":",","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.958391572Z","response":" decept","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.971035028Z","response":"ively","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.989678484Z","response":" tricky","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.999321940Z","response":" question","done":false} ...