A Standardized Cost Benchmark for AI Models | Generated by AI

Home 2025.10

Got it — let me unpack what you wrote and translate the idea into English, then analyze the issue.


Restating the context


On the cache metrics you mentioned

This triple effect (better caching, fewer tokens, shorter answers) can create non-linear cost differences versus what headline pricing suggests.


Your observation with DeepSeek v3.2 pricing

You noticed:

Why this happens:

  1. Headline pricing ≠ Effective cost.
    Providers lower “per million tokens” prices, but if your usage pattern is dominated by cache hits or shorter completions, the marginal realized savings may be very different.

  2. Input vs output asymmetry.
    Often input tokens are cheaper than output tokens. If DeepSeek dropped input prices by 50% but left output nearly the same, and your workload has long outputs → your costs don’t fall as much.

  3. Billing granularity & rounding.
    Some APIs bill per 1K tokens or 100 tokens (depending on backend implementation). For small tasks, rounding errors can distort total cost differences.

  4. Cache accounting differences.
    Some vendors treat cache hits as free or discounted, some charge fractionally. If DeepSeek changed only the cache miss rate but your workload is mostly cached already, your effective price reduction is smaller than advertised.

  5. Announcement vs rollout lag.
    Sometimes platforms like OpenRouter reflect the headline price drop immediately, but the underlying API billing lags until all regions are updated. So “not right after check” could be a propagation issue.


Big picture


So in English summary:
You found that even though DeepSeek v3.2 announced a 50% price cut, your actual observed costs didn’t fall by 50%. This discrepancy happens because actual LLM task cost is shaped by caching, input/output distribution, and usage patterns, not just the headline price per token. Without standardized tools to normalize these factors, the industry can’t yet benchmark real cost-to-performance as cleanly as it does for accuracy (MMLU, SWE-bench, etc.).


Would you like me to propose a prototype methodology for such a cost vs performance benchmark, so you could compare models like DeepSeek, GPT, Claude, etc., in real-world task scenarios instead of just theoretical price-per-token?


Back

openai/gpt-5

Donate