A Red Teamer’s OSINT Company Recon Guide | Generated by AI

Home 2026.06

Question: What counts as “public information” about a company, and how can you systematically use it — plus hacker/OSINT techniques — to deeply research a company’s products, technology stack, and infrastructure before joining, augmented by an AI agent?

Answer:

This is OSINT (Open Source Intelligence) applied to corporate due diligence. The mental model: you’re a red-teamer doing recon, not a job applicant skimming Glassdoor.


The Public Information Surface

1. People Layer

2. Product / Frontend Layer

3. Infrastructure Layer

4. Code / Artifact Layer

5. Cloud-Specific Caution ⚠️

This is where it gets legally sensitive. Passive recon = fine. Active exploitation = illegal.

Technique Legal? Notes
Certificate transparency Fully passive
Shodan lookup Passive, Shodan scanned it
DNS enumeration Passive
Subdomain brute-force ⚠️ Gray area — sends packets
Accessing exposed S3 buckets ⚠️/❌ Even “public” buckets are legally risky
Fuzzing their APIs CFAA territory
Exploiting any vulnerability Criminal

AI Agent Architecture for This

Here’s a practical agent that automates the recon pipeline:

import anthropic
import subprocess
import json
import httpx
from typing import Any

client = anthropic.Anthropic()

tools = [
    {
        "name": "run_subfinder",
        "description": "Enumerate subdomains via certificate transparency and passive DNS",
        "input_schema": {
            "type": "object",
            "properties": {"domain": {"type": "string"}},
            "required": ["domain"]
        }
    },
    {
        "name": "fetch_crtsh",
        "description": "Query certificate transparency logs for subdomains",
        "input_schema": {
            "type": "object",
            "properties": {"domain": {"type": "string"}},
            "required": ["domain"]
        }
    },
    {
        "name": "fetch_job_postings",
        "description": "Fetch and analyze job postings to infer tech stack",
        "input_schema": {
            "type": "object",
            "properties": {"company_name": {"type": "string"}},
            "required": ["company_name"]
        }
    },
    {
        "name": "analyze_headers",
        "description": "Fetch HTTP headers from a URL to detect CDN, cloud provider, frameworks",
        "input_schema": {
            "type": "object",
            "properties": {"url": {"type": "string"}},
            "required": ["url"]
        }
    },
    {
        "name": "github_search",
        "description": "Search GitHub for company repos and leaked info",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"]
        }
    }
]

def run_tool(name: str, inputs: dict) -> Any:
    if name == "fetch_crtsh":
        domain = inputs["domain"]
        r = httpx.get(
            f"https://crt.sh/?q=%.{domain}&output=json",
            timeout=15
        )
        names = list(set(
            entry["name_value"]
            for entry in r.json()
            if "*" not in entry["name_value"]
        ))
        return {"subdomains": names[:50]}  # cap it

    elif name == "analyze_headers":
        url = inputs["url"]
        r = httpx.head(url, follow_redirects=True, timeout=10)
        return {
            "headers": dict(r.headers),
            "status": r.status_code,
            "final_url": str(r.url)
        }

    elif name == "github_search":
        # Use GitHub search API
        query = inputs["query"]
        r = httpx.get(
            "https://api.github.com/search/repositories",
            params={"q": query, "sort": "updated", "per_page": 10},
            headers={"Accept": "application/vnd.github.v3+json"},
            timeout=10
        )
        items = r.json().get("items", [])
        return [{"name": i["full_name"], "desc": i["description"],
                 "stars": i["stargazers_count"], "lang": i["language"]}
                for i in items]

    return {"error": f"Tool {name} not implemented"}

def recon_agent(target_company: str, target_domain: str) -> str:
    messages = [
        {
            "role": "user",
            "content": f"""Do a passive OSINT recon on company '{target_company}' with domain '{target_domain}'.

Your goal: build a technical intelligence report covering:
1. Infrastructure (cloud provider, CDN, IP ranges)
2. Tech stack (frameworks, databases, monitoring)
3. Engineering team signals (from GitHub)
4. Subdomains that reveal internal product architecture

Use tools sequentially. Start with cert transparency, then headers, then GitHub.
Only passive recon — no active scanning or exploitation."""
        }
    ]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "end_turn":
            return next(
                b.text for b in response.content
                if hasattr(b, 'text')
            )

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                print(f"[agent] calling {block.name}({block.input})")
                result = run_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result)
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

if __name__ == "__main__":
    report = recon_agent("Anthropic", "anthropic.com")
    print(report)

The Synthesis Layer — What AI Adds

Raw OSINT data is noise. The agent’s real value is synthesis:

Subdomains found:
  api-internal.target.com → suggests separate internal API gateway
  grafana.target.com → Prometheus/Grafana observability stack
  kafka-ui.target.com → Kafka in use, exposed management UI
  staging-ml.target.com → separate ML serving infrastructure

Job postings mention:
  "PyTorch, Ray, vLLM" → ML inference stack
  "Temporal for workflow orchestration"
  "ClickHouse for analytics"

GitHub:
  3 public repos using their internal SDK → naming conventions revealed

Conclusion: AWS-based, Kafka+Temporal event-driven architecture,
Ray/vLLM for ML serving, Grafana observability, likely Kubernetes.
Engineering team ~40 people based on LinkedIn + commit velocity.

No insider info needed. A motivated outsider with this toolkit knows more about the product stack than many employees in non-technical roles.


What You Can’t Get (Without Going Inside)

But product quality, architecture, scale, observability maturity, security hygiene — all visible from outside.


Back Donate