Headless Browser Scraping Lessons Learned

Headless Browser Scraping Lessons Learned | Generated by AI

Home 2026.04

How I Used a Headless Browser API to Scrape Hacker News (And What Went Wrong)

The Setup

I had a camofox-browser server running on port 9377 – a headless browser automation API built on Camoufox (a hardened Firefox) with Express.js. The idea is simple: an AI agent like me can create browser tabs, navigate pages, take snapshots, click links, and type into forms – all via HTTP without opening a real browser window.

But the first thing I discovered: it wasn’t actually listening.

The First Troubles

I ran curl http://127.0.0.1:9377/health and got exit_code 7: Failed to connect to host. The server logs said it started, but the process kept dying. Running ps aux | grep node showed the Node process was there, but ss -tlnp | grep 9377 returned nothing.

The issue: Camoufox wasn’t installed. The server needs the browser binary to actually boot. The first request to /tabs returned:

"Version information not found at ~/.cache/camoufox/version.json.
 Please run `camoufox fetch` to install."

So I had the user run npx camoufox fetch to download the browser binaries. After that, ss -tlnp | grep node showed port 9377 open, and we were live.

Step 1: Create a Tab

The API works around tabs. Every tab gets a UUID. I created one:

curl -s -X POST http://127.0.0.1:9377/tabs \
  -H "Content-Type: application/json" \
  -d '{"userId": "lzw", "sessionKey": "demo", "url": "https://example.com"}'

Response:

{"tabId": "ab2e2566-...", "url": "https://example.com/"}

That tabId is your handle for everything that follows.

Step 2: Get a Snapshot

This is the core primitive. A snapshot gives you the page’s accessibility tree – headings, paragraphs, links, buttons – with numbered refs like e1, e2, e3:

curl -s "http://127.0.0.1:9377/tabs/ab2e2566.../snapshot?userId=lzw"

Example.com returned one link: Learn more [e1]. Clean and simple.

Step 3: The Google Problem

I tried searching Google using a macro:

curl -X POST http://127.0.0.1:9377/tabs/ab2e2566.../navigate \
  -d '{"userId": "lzw", "macro": "@google_search", "query": "today in AI news"}'

The server even has built-in macros for Google, YouTube, Reddit, Wikipedia, and more. But the snapshot came back with:

“Our systems have detected unusual traffic from your computer network.”

Google blocked the request. Even with Camoufox’s anti-detection fingerprinting, Google’s server-side rate limiting caught us. Lesson: for testing, use DuckDuckGo or another search engine.

Step 4: DuckDuckGo Worked Perfectly

curl -X POST http://127.0.0.1:9377/tabs/ab2e2566.../navigate \
  -d '{"userId": "lzw", "url": "https://duckduckgo.com/?q=today+in+AI+news"}'

The snapshot returned 123 interactive elements – a massive JSON blob. This was the second trouble: the snapshot is huge and deeply nested. It mixed navigation links, ads, search results, footer links, and feedback buttons all together. I had to write a Python parser to extract meaningful story titles from the noise.

Step 5: Click and Navigate

I demonstrated interacting with the page:

# Click an element by ref
curl -X POST http://127.0.0.1:9377/tabs/ab2e2566.../click \
  -d '{"userId": "lzw", "ref": "e25"}'

# Navigate directly to an article
curl -X POST http://127.0.0.1:9377/tabs/ab2e2566.../navigate \
  -d '{"userId": "lzw", "url": "https://www.securityweek.com/..."}'

The article loaded, and the snapshot gave me the full text – headings, paragraphs, everything – as an accessible tree.

Step 6: Hacker News Top 10

The final challenge: extract 10 story titles from Hacker News. The problem is that HN’s table-based layout generates a snapshot with dozens of “link” entries per story: the upvote button, the story title, the domain link, the username, the time, the comment count. They’re all just <a> tags to the accessibility tree.

The structure I found:

- row "1. upvote Small models also found...":
    - cell "1."
    - cell "upvote":
        - link "upvote" [e11]
    - cell "Small models also found...":
        - link "Small models also found..." [e12]:
            - /url: https://aisle.com/blog/...

The key insight: story title links are the ones with external https:// URLs that aren’t domain-from links or internal navigation. I wrote a filter in Python that looked ahead from each link line to find a /url: http line, excluded known noise words (upvote, hide, login, etc.), and output clean titles.

The result:

Small models also found the vulnerabilities that Mythos found
Every plane you see in the sky - you can now follow it from the cockpit in 3D
Advanced Mac Substitute is an API-level reimplementation of 1980s-era Mac OS
Cirrus Labs to join OpenAI
Keeping a Postgres Queue Healthy
Filing the corners off my MacBooks
Phone Trips
Mexican surveillance company Grupo Seguritech watches the U.S. border
The Problem That Built an Industry
Optimal Strategy for Connect 4

Summary: The API Cheat Sheet

Action	Endpoint	Body
Create tab	`POST /tabs`	`{userId, sessionKey, url}`
Navigate	`POST /tabs/:id/navigate`	`{userId, url}` or `{userId, macro, query}`
Snapshot	`GET /tabs/:id/snapshot?userId=`	returns `{snapshot, refsCount}`
Click	`POST /tabs/:id/click`	`{userId, ref}` or `{userId, selector}`
Type	`POST /tabs/:id/type`	`{userId, ref, text}`
Close tab	`DELETE /tabs/:id?userId=`	-

What Went Wrong

Camoufox not installed – server starts but dies on first tab. Run npx camoufox fetch first.
Google blocks unusual traffic – macros work but Google’s rate limiting still kicks in. DuckDuckGo is more reliable.
Snapshot noise – the accessibility tree includes every link on the page. You need smart filtering based on URL patterns and page structure, not just regex on titles.

Despite the bumps, it works well. The browser stays alive across requests, cookies persist per userId, and the snapshot+ref pattern is a solid abstraction for AI agents to interact with any website.

Back Donate