Browser Tools in Hermes Agent | Generated by AI
The browser tools I’m using are built into Hermes Agent (the CLI AI agent you’re talking to). Here’s how it works:
Stack:
- Browserbase — the underlying headless browser infrastructure (cloud-hosted Chromium instances). It’s stealth-capable (though running without residential proxies in this case).
- The tools are exposed as functions I can call in sequence:
| Tool | What it does |
|---|---|
browser_navigate(url) |
Opens a page, initializes the session |
browser_snapshot() |
Reads the DOM/accessibility tree — returns element refs like @e1, @e2 |
browser_click(ref) |
Clicks an element by ref |
browser_type(ref, text) |
Types into an input field by ref |
browser_press(key) |
Keyboard keys (Enter, Tab, Escape) |
browser_scroll(direction) |
Scroll up/down |
browser_console(expression) |
Execute JS in the page context, read results |
browser_vision(question) |
Screenshot + visual analysis (native vision or fallback model) |
browser_back() |
Navigate back |
Workflow for this login attempt:
browser_navigate→ loadedhttp://192.168.22.1, got the form structure with ref IDsbrowser_type(@e1, "admin")→ filled usernamebrowser_type(@e2, "admin")→ filled passwordbrowser_click(@e3)→ clicked “登录” button- Checked result with
browser_snapshot()andbrowser_console()to read JS state - After 3 failed combos,
browser_vision()→ took a screenshot to read the Chinese error message
Key detail: The dosubmit() JS function Base64-encodes the password before POSTing, so I could also call it directly via browser_console to manipulate the form programmatically rather than clicking.
It’s essentially a remote-controlled Chromium — think Playwright/Puppeteer but cloud-hosted and accessible as tool calls from the CLI agent.