Browser and OS automation in Cloud Run

Build automation tools or run a full desktop operating system (OS) in your Cloud Run container to allow AI agents to browse and extract information from the web, and automate actions through mouse clicks and keyboard inputs.

Build browser tools on Cloud Run

To build a browser tool on Cloud Run, use one of the following approaches:

To let your AI agent navigate the web, install Chromium in your Cloud Run container, and grant the necessary permissions for the agent to access Chromium. Cloud Run provides built-in streaming support for streaming browser data back to the agent or the end user.

Headless Chrome

Automate common browser tasks programmatically with headless Chrome. You can use headless Chrome for the following use cases:

  • Large-scale web scraping and data extraction
  • Form submissions
  • UI testing
  • Create PDFs or screenshots of web pages

Implement headless Chrome using the following libraries:

  • High-level API libraries like Puppeteer or Playwright: use these libraries to control a browser, provide instructions to the browser to visit a website, extract content, and pass it to an AI model for summarization or structured data extraction.

  • Chrome DevTool protocol: provides a stable API used by Chrome DevTools. This API exposes all browser features programmatically. The agent controls actions like mouse clicks and retrieves the results as text or pixel data in the form of a screenshot.

Desktop OS with virtual network computing (VNC) streaming

Implement a full desktop OS in your Cloud Run container for complex processes, such as the following:

  • Automate file uploads or downloads
  • Interact with browser extensions or other desktop applications
  • Test complex user journeys that involve drag-and-drop and other intricate mouse movements

This approach lets you run a full desktop OS on Cloud Run and stream the results back through Websockets.

When you install the standard Chromium browser on this desktop, the agent interacts with the OS like a human would and then retrieves the pixel configuration of the desktop.