借助 Gemini 2.5 计算机使用模型和工具,您可以让应用在浏览器中进行互动并自动执行任务。通过使用屏幕截图,计算机使用模型可以推断出有关计算机屏幕的信息,并通过生成特定的界面操作(例如鼠标点击和键盘输入)来执行操作。与函数调用类似,您需要编写客户端应用代码来接收 Computer Use 模型和工具函数调用,并执行相应的操作。
借助计算机使用模型和工具,您可以构建能够执行以下操作的代理:
- 自动执行网站上重复的数据输入或表单填写操作。
- 浏览网站以收集信息。
- 通过在 Web 应用中执行一系列操作来帮助用户。
本指南涵盖以下内容:
本指南假定您使用的是 Gen AI SDK for Python,并且熟悉 Playwright API。
在预览期间,其他 SDK 语言或 Google Cloud 控制台不支持“电脑使用情况”模型和工具。
此外,您还可以在 GitHub 中查看“电脑使用情况”模型和工具的参考实现。
“电脑使用”模型和工具的运作方式
计算机使用模型和工具不会生成文本回答,而是确定何时执行特定的界面操作(例如点击鼠标),并返回执行这些操作所需的参数。您需要编写客户端应用代码,以接收“电脑使用情况”模型和工具 function_call
并执行相应操作。
计算机使用模型和工具互动遵循代理循环流程:
向模型发送请求
- 将“计算机使用情况”模型和工具以及任何其他可选工具添加到您的 API 请求中。
- 使用用户请求和表示当前 GUI 状态的屏幕截图来提示“电脑使用”模型和工具。
接收模型响应
- 模型会分析用户请求和屏幕截图,并生成包含建议
function_call
的回答,该建议表示界面操作(例如,“点击坐标 (x,y)”或“输入‘文本’”)。如需查看可与该模型搭配使用的所有操作的列表,请参阅支持的操作。 - API 响应可能还包含来自内部安全系统的
safety_response
,该系统已检查模型的建议操作。此safety_response
将操作归类为:- 常规或允许:该操作被视为安全操作。这也可以通过不存在
safety_response
来表示。 - 需要确认:模型即将执行可能存在风险的操作(例如,点击“接受 Cookie 横幅”)。
- 常规或允许:该操作被视为安全操作。这也可以通过不存在
- 模型会分析用户请求和屏幕截图,并生成包含建议
执行接收到的操作
- 您的客户端代码会收到
function_call
和任何随附的safety_response
。 - 如果
safety_response
指示常规或允许(或者如果不存在safety_response
),您的客户端代码可以在目标环境(例如 Web 浏览器)中执行指定的function_call
。 - 如果
safety_response
表示需要确认,您的应用必须在执行function_call
之前提示最终用户进行确认。如果用户确认,则继续执行相应操作。如果用户拒绝,则不执行该操作。
- 您的客户端代码会收到
捕获新环境状态
- 如果操作已执行,客户端会捕获 GUI 和当前网址的新屏幕截图,并将其作为
function_response
的一部分发送回“电脑使用”模型和工具。 - 如果某项操作被安全系统阻止或被用户拒绝确认,您的应用可能会向模型发送其他形式的反馈,或者结束互动。
- 如果操作已执行,客户端会捕获 GUI 和当前网址的新屏幕截图,并将其作为
系统会向模型发送包含更新后状态的新请求。该流程从第 2 步开始重复,计算机使用模型和工具会使用新屏幕截图(如果提供)和正在进行的目标来建议下一步操作。该循环会一直持续,直到任务完成、发生错误或进程终止(例如,如果回答被安全过滤条件或用户决定阻止)。
下图展示了电脑使用情况模型和工具的运作方式:
启用“电脑使用”模型和工具
如需启用“电脑使用”模型和工具,请使用 gemini-2.5-computer-use-preview-10-2025
作为模型,并将“电脑使用”模型和工具添加到已启用工具的列表中:
Python
from google import genai from google.genai import types from google.genai.types import Content, Part, FunctionResponse client = genai.Client() # Add Computer Use model and tool to the list of tools generate_content_config = genai.types.GenerateContentConfig( tools=[ types.Tool( computer_use=types.ComputerUse( environment=types.Environment.ENVIRONMENT_BROWSER, ) ), ] ) # Example request using the Computer Use model and tool contents = [ Content( role="user", parts=[ Part(text="Go to google.com and search for 'weather in New York'"), ], ) ] response = client.models.generate_content( model="gemini-2.5-computer-use-preview-10-2025", contents=contents, config=generate_content_config, )
发送请求
配置“计算机使用”模型和工具后,向模型发送提示,其中包含用户目标和 GUI 的初始屏幕截图。
您还可以选择添加以下内容:
- 排除的操作:如果您不希望模型执行支持的界面操作列表中的任何操作,请在
excluded_predefined_functions
中指定这些操作。 - 用户定义的函数:除了计算机使用模型和工具之外,您可能还想添加自定义的用户定义的函数。
以下示例代码可启用“电脑使用”模型和工具,并将请求发送给该模型:
Python
from google import genai from google.genai import types from google.genai.types import Content, Part client = genai.Client() # Specify predefined functions to exclude (optional) excluded_functions = ["drag_and_drop"] # Configuration for the Computer Use model and tool with browser environment generate_content_config = genai.types.GenerateContentConfig( tools=[ # 1. Computer Use model and tool with browser environment types.Tool( computer_use=types.ComputerUse( environment=types.Environment.ENVIRONMENT_BROWSER, # Optional: Exclude specific predefined functions excluded_predefined_functions=excluded_functions ) ), # 2. Optional: Custom user-defined functions (need to defined above) # types.Tool( # function_declarations=custom_functions # ) ], ) # Create the content with user message contents: list[Content] = [ Content( role="user", parts=[ Part(text="Search for highly rated smart fridges with touchscreen, 2 doors, around 25 cu ft, priced below 4000 dollars on Google Shopping. Create a bulleted list of the 3 cheapest options in the format of name, description, price in an easy-to-read layout."), # Optional: include a screenshot of the initial state # Part.from_bytes( # data=screenshot_image_bytes, # mime_type='image/png', # ), ], ) ] # Generate content with the configured settings response = client.models.generate_content( model='gemini-2.5-computer-use-preview-10-2025', contents=contents, config=generate_content_config, ) # Print the response output print(response.text)
您还可以添加自定义的用户定义函数,以扩展模型的功能。如需了解如何通过添加 open_app
、long_press_at
和 go_home
等操作来配置移动用例的计算机使用情况,同时排除特定于浏览器的操作,请参阅将计算机使用情况模型和工具用于移动用例。
接收回答
如果模型确定需要执行界面操作或使用用户定义的函数才能完成任务,则会返回一个或多个 FunctionCalls
。您的应用代码需要解析这些操作、执行这些操作并收集结果。计算机使用模型和工具支持并行函数调用,这意味着模型可以在单个对话轮次中返回多个独立的操作。
{
"content": {
"parts": [
{
"text": "I will type the search query into the search bar. The search bar is in the center of the page."
},
{
"function_call": {
"name": "type_text_at",
"args": {
"x": 371,
"y": 470,
"text": "highly rated smart fridges with touchscreen, 2 doors, around 25 cu ft, priced below 4000 dollars on Google Shopping",
"press_enter": true
}
}
}
]
}
}
根据操作的不同,API 响应可能还会返回 safety_response
:
{
"content": {
"parts": [
{
"text": "I have evaluated step 2. It seems Google detected unusual traffic and is asking me to verify I'm not a robot. I need to click the 'I'm not a robot' checkbox located near the top left (y=98, x=95)."
},
{
"function_call": {
"name": "click_at",
"args": {
"x": 60,
"y": 100,
"safety_decision": {
"explanation": "I have encountered a CAPTCHA challenge that requires interaction. I need you to complete the challenge by clicking the 'I'm not a robot' checkbox and any subsequent verification steps.",
"decision": "require_confirmation"
}
}
}
}
]
}
}
执行收到的操作
收到回答后,模型需要执行收到的操作。
以下代码从 Gemini 响应中提取函数调用,将坐标从 0-1000 范围转换为实际像素,使用 Playwright 执行浏览器操作,并返回每个操作的成功或失败状态:
import time
from typing import Any, List, Tuple
def normalize_x(x: int, screen_width: int) -> int:
"""Convert normalized x coordinate (0-1000) to actual pixel coordinate."""
return int(x / 1000 * screen_width)
def normalize_y(y: int, screen_height: int) -> int:
"""Convert normalized y coordinate (0-1000) to actual pixel coordinate."""
return int(y / 1000 * screen_height)
def execute_function_calls(response, page, screen_width: int, screen_height: int) -> List[Tuple[str, Any]]:
"""
Extract and execute function calls from Gemini response.
Args:
response: Gemini API response object
page: Playwright page object
screen_width: Screen width in pixels
screen_height: Screen height in pixels
Returns:
List of tuples: [(function_name, result), ...]
"""
# Extract function calls and thoughts from the model's response
candidate = response.candidates[0]
function_calls = []
thoughts = []
for part in candidate.content.parts:
if hasattr(part, 'function_call') and part.function_call:
function_calls.append(part.function_call)
elif hasattr(part, 'text') and part.text:
thoughts.append(part.text)
if thoughts:
print(f"Model Reasoning: {' '.join(thoughts)}")
# Execute each function call
results = []
for function_call in function_calls:
result = None
try:
if function_call.name == "open_web_browser":
print("Executing open_web_browser")
# Browser is already open via Playwright, so this is a no-op
result = "success"
elif function_call.name == "click_at":
actual_x = normalize_x(function_call.args["x"], screen_width)
actual_y = normalize_y(function_call.args["y"], screen_height)
print(f"Executing click_at: ({actual_x}, {actual_y})")
page.mouse.click(actual_x, actual_y)
result = "success"
elif function_call.name == "type_text_at":
actual_x = normalize_x(function_call.args["x"], screen_width)
actual_y = normalize_y(function_call.args["y"], screen_height)
text = function_call.args["text"]
press_enter = function_call.args.get("press_enter", False)
clear_before_typing = function_call.args.get("clear_before_typing", True)
print(f"Executing type_text_at: ({actual_x}, {actual_y}) text='{text}'")
# Click at the specified location
page.mouse.click(actual_x, actual_y)
time.sleep(0.1)
# Clear existing text if requested
if clear_before_typing:
page.keyboard.press("Control+A")
page.keyboard.press("Backspace")
# Type the text
page.keyboard.type(text)
# Press enter if requested
if press_enter:
page.keyboard.press("Enter")
result = "success"
else:
# For any functions not parsed above
print(f"Unrecognized function: {function_call.name}")
result = "unknown_function"
except Exception as e:
print(f"Error executing {function_call.name}: {e}")
result = f"error: {str(e)}"
results.append((function_call.name, result))
return results
如果返回的 safety_decision
为 require_confirmation
,您必须先征得用户确认,然后才能继续执行相应操作。根据服务条款,您不得绕过人工确认请求。
以下代码为之前的代码添加了安全逻辑:
import termcolor
def get_safety_confirmation(safety_decision):
"""Prompt user for confirmation when safety check is triggered."""
termcolor.cprint("Safety service requires explicit confirmation!", color="red")
print(safety_decision["explanation"])
decision = ""
while decision.lower() not in ("y", "n", "ye", "yes", "no"):
decision = input("Do you wish to proceed? [Y]es/[N]o\n")
if decision.lower() in ("n", "no"):
return "TERMINATE"
return "CONTINUE"
def execute_function_calls(response, page, screen_width: int, screen_height: int):
# ... Extract function calls from response ...
for function_call in function_calls:
extra_fr_fields = {}
# Check for safety decision
if 'safety_decision' in function_call.args:
decision = get_safety_confirmation(function_call.args['safety_decision'])
if decision == "TERMINATE":
print("Terminating agent loop")
break
extra_fr_fields["safety_acknowledgement"] = "true"
# ... Execute function call and append to results ...
捕获新状态
执行操作后,将函数执行结果发送回模型,以便模型可以使用此信息生成下一个操作。如果执行了多项操作(并行调用),您必须在后续用户回合中为每项操作发送一个 FunctionResponse
。对于用户定义的函数,FunctionResponse
应包含所执行函数的返回值。
function_response_parts = []
for name, result in results:
# Take screenshot after each action
screenshot = page.screenshot()
current_url = page.url
function_response_parts.append(
FunctionResponse(
name=name,
response={"url": current_url}, # Include safety acknowledgement
parts=[
types.FunctionResponsePart(
inline_data=types.FunctionResponseBlob(
mime_type="image/png", data=screenshot
)
)
]
)
)
# Create the user feedback content with all responses
user_feedback_content = Content(
role="user",
parts=function_response_parts
)
# Append this feedback to the 'contents' history list for the next API call
contents.append(user_feedback_content)
构建代理循环
将上述步骤合并为一个循环,以实现多步互动。循环必须处理并行函数调用。请务必通过附加模型响应和函数响应来正确管理对话历史记录(内容数组)。
Python
from google import genai from google.genai.types import Content, Part from playwright.sync_api import sync_playwright def has_function_calls(response): """Check if response contains any function calls.""" candidate = response.candidates[0] return any(hasattr(part, 'function_call') and part.function_call for part in candidate.content.parts) def main(): client = genai.Client() # ... (config setup from "Send a request to model" section) ... with sync_playwright() as p: browser = p.chromium.launch(headless=False) page = browser.new_page() page.goto("https://www.google.com") screen_width, screen_height = 1920, 1080 # ... (initial contents setup from "Send a request to model" section) ... # Agent loop: iterate until model provides final answer for iteration in range(10): print(f"\nIteration {iteration + 1}\n") # 1. Send request to model (see "Send a request to model" section) response = client.models.generate_content( model='gemini-2.5-computer-use-preview-10-2025', contents=contents, config=generate_content_config, ) contents.append(response.candidates[0].content) # 2. Check if done - no function calls means final answer if not has_function_calls(response): print(f"FINAL RESPONSE:\n{response.text}") break # 3. Execute actions (see "Execute the received actions" section) results = execute_function_calls(response, page, screen_width, screen_height) time.sleep(1) # 4. Capture state and create feedback (see "Capture the New State" section) contents.append(create_feedback(results, page)) input("\nPress Enter to close browser...") browser.close() if __name__ == "__main__": main()
适用于移动用例的计算机使用模型和工具
以下示例演示了如何定义自定义函数(例如 open_app
、long_press_at
和 go_home
)、将它们与 Gemini 的内置电脑使用工具相结合,以及排除不必要的浏览器专用函数。通过注册这些自定义函数,模型可以智能地调用它们以及标准界面操作,以便在非浏览器环境中完成任务。
from typing import Optional, Dict, Any
from google import genai
from google.genai import types
from google.genai.types import Content, Part
client = genai.Client()
def open_app(app_name: str, intent: Optional[str] = None) -> Dict[str, Any]:
"""Opens an app by name.
Args:
app_name: Name of the app to open (any string).
intent: Optional deep-link or action to pass when launching, if the app supports it.
Returns:
JSON payload acknowledging the request (app name and optional intent).
"""
return {"status": "requested_open", "app_name": app_name, "intent": intent}
def long_press_at(x: int, y: int, duration_ms: int = 500) -> Dict[str, int]:
"""Long-press at a specific screen coordinate.
Args:
x: X coordinate (absolute), scaled to the device screen width (pixels).
y: Y coordinate (absolute), scaled to the device screen height (pixels).
duration_ms: Press duration in milliseconds. Defaults to 500.
Returns:
Object with the coordinates pressed and the duration used.
"""
return {"x": x, "y": y, "duration_ms": duration_ms}
def go_home() -> Dict[str, str]:
"""Navigates to the device home screen.
Returns:
A small acknowledgment payload.
"""
return {"status": "home_requested"}
# Build function declarations
CUSTOM_FUNCTION_DECLARATIONS = [
types.FunctionDeclaration.from_callable(client=client, callable=open_app),
types.FunctionDeclaration.from_callable(client=client, callable=long_press_at),
types.FunctionDeclaration.from_callable(client=client, callable=go_home),
]
# Exclude browser functions
EXCLUDED_PREDEFINED_FUNCTIONS = [
"open_web_browser",
"search",
"navigate",
"hover_at",
"scroll_document",
"go_forward",
"key_combination",
"drag_and_drop",
]
# Utility function to construct a GenerateContentConfig
def make_generate_content_config() -> genai.types.GenerateContentConfig:
"""Return a fixed GenerateContentConfig with Computer Use + custom functions."""
return genai.types.GenerateContentConfig(
tools=[
types.Tool(
computer_use=types.ComputerUse(
environment=types.Environment.ENVIRONMENT_BROWSER,
excluded_predefined_functions=EXCLUDED_PREDEFINED_FUNCTIONS,
)
),
types.Tool(function_declarations=CUSTOM_FUNCTION_DECLARATIONS),
]
)
# Create the content with user message
contents: list[Content] = [
Content(
role="user",
parts=[
# text instruction
Part(text="Open Chrome, then long-press at 200,400."),
# optional screenshot attachment
Part.from_bytes(
data=screenshot_image_bytes,
mime_type="image/png",
),
],
)
]
# Build your fixed config (from helper)
config = make_generate_content_config()
# Generate content with the configured settings
response = client.models.generate_content(
model="gemini-2.5-computer-use-preview-10-2025",
contents=contents,
config=generate_content_config,
)
print(response)
支持的操作
借助“电脑使用”模型和工具,模型可以使用 FunctionCall
请求执行以下操作。您的客户端代码必须实现这些操作的执行逻辑。如需查看示例,请参阅参考实现。
命令名称 | 说明 | 实参(在函数调用中) | 函数调用示例 |
---|---|---|---|
open_web_browser | 打开网络浏览器。 | 无 | {"name": "open_web_browser", "args": {}} |
wait_5_seconds | 暂停执行 5 秒,以便加载动态内容或完成动画。 | 无 | {"name": "wait_5_seconds", "args": {}} |
go_back | 前往浏览器历史记录中的上一页。 | 无 | {"name": "go_back", "args": {}} |
go_forward | 前往浏览器历史记录中的下一页。 | 无 | {"name": "go_forward", "args": {}} |
search | 前往默认搜索引擎(例如 Google)的首页。有助于启动新的搜索任务。 | 无 | {"name": "search", "args": {}} |
navigate | 直接将浏览器导航到指定网址。 | url :字符串 |
{"name": "navigate", "args": {"url": "https://www.wikipedia.org"}} |
click_at | 点击网页上特定坐标处的元素。x 值和 y 值基于 1000x1000 网格,并会缩放到屏幕尺寸。 | y :整数 (0-999),x :整数 (0-999) |
{"name": "click_at", "args": {"y": 300, "x": 500}} |
hover_at | 将鼠标悬停在网页上的特定坐标处。可用于显示子菜单。x 和 y 基于 1000x1000 网格。 | y :整数 (0-999) x :整数 (0-999) |
{"name": "hover_at", "args": {"y": 150, "x": 250}} |
type_text_at | 在特定坐标处输入文字,默认情况下先清空字段,然后在输入完毕后按 Enter 键,但这些操作可以停用。x 和 y 基于 1000x1000 网格。 | y :int (0-999),x :int (0-999),text :str,press_enter :bool(可选,默认值为 True),clear_before_typing :bool(可选,默认值为 True) |
{"name": "type_text_at", "args": {"y": 250, "x": 400, "text": "search query", "press_enter": false}} |
key_combination | 按键盘按键或组合键,例如“Ctrl+C”或“Enter”。可用于触发操作(例如使用“Enter”键提交表单)或剪贴板操作。 | keys :str(例如,“enter”“control+c”)。如需查看允许的键的完整列表,请参阅 API 参考文档) |
{"name": "key_combination", "args": {"keys": "Control+A"}} |
scroll_document | 向上、向下、向左或向右滚动整个网页。 | direction :字符串(“上”“下”“左”或“右”) |
{"name": "scroll_document", "args": {"direction": "down"}} |
scroll_at | 按指定方向以一定幅度滚动坐标 (x, y) 处的特定元素或区域。坐标和大小(默认值为 800)基于 1000x1000 网格。 | y :int (0-999),x :int (0-999),direction :str(“上”“下”“左”“右”),magnitude :int(0-999,可选,默认值为 800) |
{"name": "scroll_at", "args": {"y": 500, "x": 500, "direction": "down", "magnitude": 400}} |
drag_and_drop | 从起始坐标 (x, y) 拖动元素,并将其放置在目标坐标 (destination_x, destination_y) 处。所有坐标均基于 1000x1000 网格。 | y :int (0-999),x :int (0-999),destination_y :int (0-999),destination_x :int (0-999) |
{"name": "drag_and_drop", "args": {"y": 100, "x": 100, "destination_y": 500, "destination_x": 500}} |
安全
本部分介绍了计算机使用模型和工具为增强用户控制能力和提高安全性而采取的保障措施。本文还介绍了相关最佳实践,以降低该工具可能带来的潜在新风险。
确认安全决定
根据操作的不同,电脑使用模型和工具的回答可能包含来自内部安全系统的 safety_decision
。此决策用于验证工具为安全起见而建议的操作。
{
"content": {
"parts": [
{
"text": "I have evaluated step 2. It seems Google detected unusual traffic and is asking me to verify I'm not a robot. I need to click the 'I'm not a robot' checkbox located near the top left (y=98, x=95)."
},
{
"function_call": {
"name": "click_at",
"args": {
"x": 60,
"y": 100,
"safety_decision": {
"explanation": "I have encountered a CAPTCHA challenge that requires interaction. I need you to complete the challenge by clicking the 'I'm not a robot' checkbox and any subsequent verification steps.",
"decision": "require_confirmation"
}
}
}
}
]
}
}
如果 safety_decision
为 require_confirmation
,您必须先征得最终用户的确认,然后才能继续执行相应操作。
以下代码示例会在执行操作之前提示最终用户进行确认。如果用户未确认该操作,则循环终止。如果用户确认该操作,系统会执行该操作,并将 safety_acknowledgement
字段标记为 True
。
import termcolor
def get_safety_confirmation(safety_decision):
"""Prompt user for confirmation when safety check is triggered."""
termcolor.cprint("Safety service requires explicit confirmation!", color="red")
print(safety_decision["explanation"])
decision = ""
while decision.lower() not in ("y", "n", "ye", "yes", "no"):
decision = input("Do you wish to proceed? [Y]es/[N]o\n")
if decision.lower() in ("n", "no"):
return "TERMINATE"
return "CONTINUE"
def execute_function_calls(response, page, screen_width: int, screen_height: int):
# ... Extract function calls from response ...
for function_call in function_calls:
extra_fr_fields = {}
# Check for safety decision
if 'safety_decision' in function_call.args:
decision = get_safety_confirmation(function_call.args['safety_decision'])
if decision == "TERMINATE":
print("Terminating agent loop")
break
extra_fr_fields["safety_acknowledgement"] = "true" # Safety acknowledgement
# ... Execute function call and append to results ...
如果用户确认,您必须在 FunctionResponse
中添加安全确认。
function_response_parts.append(
FunctionResponse(
name=name,
response={"url": current_url,
**extra_fr_fields}, # Include safety acknowledgement
parts=[
types.FunctionResponsePart(
inline_data=types.FunctionResponseBlob(
mime_type="image/png", data=screenshot
)
)
]
)
)
有关安全的最佳实践
计算机使用模型和工具是一种新颖的工具,会带来开发者应注意的新风险:
- 不可信的内容和诈骗:当模型尝试实现用户目标时,可能会依赖不可信的信息来源和屏幕上的指令。例如,如果用户的目标是购买 Pixel 手机,而模型遇到“完成调查即可免费获赠 Pixel”的诈骗,则模型很有可能会完成调查。
- 偶尔出现意外操作:模型可能会误解用户目标或网页内容,导致其采取错误的操作,例如点击错误的按钮或填写错误的表单。这可能会导致任务失败或数据渗漏。
- 违反政策:该 API 的功能可能会被有意或无意地用于违反 Google 政策(《生成式 AI 使用限制政策》和《Gemini API 附加服务条款》)的活动。这包括可能会干扰系统完整性、破坏安全性、绕过 CAPTCHA 等安全措施、控制医疗设备等的行为。
为应对这些风险,您可以实施以下安全措施和最佳实践:
- 人机协同 (HITL):
- 实现用户确认:当安全响应指示 require_confirmation 时,您必须在执行之前实现用户确认。
- 提供自定义安全说明:除了内置的用户确认检查之外,开发者还可以选择添加自定义系统说明,以强制执行自己的安全政策,从而阻止某些模型操作,或者要求用户在模型采取某些高风险的不可逆操作之前进行确认。以下示例展示了在与模型互动时可以添加的自定义安全系统指令。
点击查看创建连接的示例
## **RULE 1: Seek User Confirmation (USER_CONFIRMATION)** This is your first and most important check. If the next required action falls into any of the following categories, you MUST stop immediately, and seek the user's explicit permission. **Procedure for Seeking Confirmation:** * **For Consequential Actions:** Perform all preparatory steps (e.g., navigating, filling out forms, typing a message). You will ask for confirmation **AFTER** all necessary information is entered on the screen, but **BEFORE** you perform the final, irreversible action (e.g., before clicking "Send", "Submit", "Confirm Purchase", "Share"). * **For Prohibited Actions:** If the action is strictly forbidden (e.g., accepting legal terms, solving a CAPTCHA), you must first inform the user about the required action and ask for their confirmation to proceed. **USER_CONFIRMATION Categories:** * **Consent and Agreements:** You are FORBIDDEN from accepting, selecting, or agreeing to any of the following on the user's behalf. You must ask th e user to confirm before performing these actions. * Terms of Service * Privacy Policies * Cookie consent banners * End User License Agreements (EULAs) * Any other legally significant contracts or agreements. * **Robot Detection:** You MUST NEVER attempt to solve or bypass the following. You must ask the user to confirm before performing these actions. * CAPTCHAs (of any kind) * Any other anti-robot or human-verification mechanisms, even if you are capable. * **Financial Transactions:** * Completing any purchase. * Managing or moving money (e.g., transfers, payments). * Purchasing regulated goods or participating in gambling. * **Sending Communications:** * Sending emails. * Sending messages on any platform (e.g., social media, chat apps). * Posting content on social media or forums. * **Accessing or Modifying Sensitive Information:** * Health, financial, or government records (e.g., medical history, tax forms, passport status). * Revealing or modifying sensitive personal identifiers (e.g., SSN, bank account number, credit card number). * **User Data Management:** * Accessing, downloading, or saving files from the web. * Sharing or sending files/data to any third party. * Transferring user data between systems. * **Browser Data Usage:** * Accessing or managing Chrome browsing history, bookmarks, autofill data, or saved passwords. * **Security and Identity:** * Logging into any user account. * Any action that involves misrepresentation or impersonation (e.g., creating a fan account, posting as someone else). * **Insurmountable Obstacles:** If you are technically unable to interact with a user interface element or are stuck in a loop you cannot resolve, ask the user to take over. --- ## **RULE 2: Default Behavior (ACTUATE)** If an action does **NOT** fall under the conditions for `USER_CONFIRMATION`, your default behavior is to **Actuate**. **Actuation Means:** You MUST proactively perform all necessary steps to move the user's request forward. Continue to actuate until you either complete the non-consequential task or encounter a condition defined in Rule 1. * **Example 1:** If asked to send money, you will navigate to the payment portal, enter the recipient's details, and enter the amount. You will then **STOP** as per Rule 1 and ask for confirmation before clicking the final "Send" button. * **Example 2:** If asked to post a message, you will navigate to the site, open the post composition window, and write the full message. You will then **STOP** as per Rule 1 and ask for confirmation before clicking the final "Post" button. After the user has confirmed, remember to get the user's latest screen before continuing to perform actions. # Final Response Guidelines: Write final response to the user in these cases: - User confirmation - When the task is complete or you have enough information to respond to the user
- 安全执行环境:在安全的沙盒环境中运行代理,以限制其潜在影响(例如,沙盒虚拟机 [VM]、容器 [如 Docker] 或权限有限的专用浏览器配置文件)。
- 输入清理:清理提示中的所有用户生成的文本,以降低意外指令或提示注入的风险。这是一种有用的安全层,但不能替代安全的执行环境。
- 许可名单和屏蔽名单:实施过滤机制,以控制模型可以访问的网站和可以执行的操作。禁止访问的网站的屏蔽名单是一个不错的起点,而限制性更强的许可名单则更加安全。
- 可观测性和日志记录:维护详细的日志,以便进行调试、审核和突发事件响应。客户端应记录提示、屏幕截图、模型建议的操作 (
function_call
)、安全响应以及客户端最终执行的所有操作。
价格
“电脑使用”模型和工具的价格与 Gemini 2.5 Pro 相同,并使用相同的 SKU。如需拆分“电脑使用”模型和工具费用,请使用自定义元数据标签。如需详细了解如何使用自定义元数据标签监控费用,请参阅自定义元数据标签。