LLM Costs, Agents, and Coding Tools

LLM Costs, Agents, and Coding Tools | Original

Home 2025.09

Optimizing LLM API Costs
- Start with cost-effective models first
- Avoid unnecessary high-end model usage
- Prefer NLP libraries for simple tasks
- Build specialized agents for efficiency
- Compare models via extensive testing
API Usage of Deepseek and Mistral
- DeepSeek costs scale with cache misses
- Output tokens dominate Mistral expenses
- Grok pricing heavily favors input tokens
- Token usage varies by task complexity
- Pricing aligns with documented rates
General Agents vs Vertical Agents
- General agents struggle with complexity
- Vertical agents excel in specialization
- Workflow tools limit flexibility
- Custom Python agents offer control
- Trade-offs between convenience and power
A Picky Engineer’s Take on AI Coding Tools
- Prefer practical utility over brand hype
- VSCode + Copilot remains reliable
- Claude Code impresses with diff-style edits
- Grammar tools require manual verification
- Experimentation trumps blind adoption

Optimizing LLM API Costs

2025.08

Source: openrouter.ai

While optimizing token usage, it’s advisable to start with more cost-effective models. Should issues arise, consider upgrading to more advanced models. Mistral, Gemini Flash, and DeepSeek are typically economical, whereas Claude Sonnet is generally more expensive. It’s crucial to understand how Claude Code uses the routers shown below.

In my recent experience, I incurred significant costs due to neglecting this principle. I was trying to reach my maximum usage to determine the cost, which isn’t a rational approach; it’s a simple calculation. For instance, do I truly need Sonnet 4? Not necessarily. Although I perceive it as a more advanced model from Anthropic and it ranks highly on OpenRouter, I’m unclear about the differences between Sonnet 4 and Sonnet 3.5.

I learned something valuable from a recent interview with Replit founder, Amjad Masad. And for many tasks, highly advanced models aren’t necessary. Ideally, if we can avoid using the LLM API altogether, that’s perfect. Certain NLP libraries are effective for simpler tasks; for example, HanLP excels at handling Chinese language tasks.

Furthermore, we can develop custom or specialized agents to handle tasks efficiently from the outset. Claude Code might not always be the best or most cost-effective solution for every task.

One way to discern the differences between models is to use them extensively and compare their performance. After some time using Gemini 2.5 Flash, I find it to be less capable than Sonnet 4.

After a few days, I use the configuration below to help. The parameter longContextThreshold is really important. You can periodically clear the console of Claude Code, or restart it. It is very easy to hit the long context threshold when using Claude Code to write code.

{
  "PROXY_URL": "http://127.0.0.1:7890",
  "LOG": true,
  "Providers": [
    {
      "name": "openrouter",
      "api_base_url": "https://openrouter.ai/api/v1/chat/completions",
      "api_key": "",
      "models": [
        "moonshotai/kimi-k2",
        "anthropic/claude-sonnet-4",
        "anthropic/claude-3.5-sonnet",
        "anthropic/claude-3.7-sonnet:thinking",
        "anthropic/claude-opus-4",
        "google/gemini-2.5-flash",
        "google/gemini-2.5-pro",
        "deepseek/deepseek-chat-v3-0324",
        "deepseek/deepseek-chat-v3.1",
        "deepseek/deepseek-r1",
        "mistralai/mistral-medium-3.1",
        "qwen/qwen3-coder",
        "openai/gpt-oss-120b",
        "openai/gpt-5",
        "openai/gpt-5-mini",
        "x-ai/grok-3-mini",
        "x-ai/grok-4"        
      ],
      "transformer": {
        "use": [
          "openrouter"
        ]
      }
    }
  ],
  "Router": {
    "default": "openrouter,x-ai/grok-4",
    "background": "openrouter,deepseek/deepseek-chat-v3.1",
    "think": "openrouter,qwen/qwen3-coder",
    "longContext": "openrouter,moonshotai/kimi-k2",
    "longContextThreshold": 2000,
    "webSearch": "openrouter,mistralai/mistral-medium-3.1"
  }
}

General Agents vs Vertical Agents

2025.08

Manus is claimed to be a general AI agent tool, but it probably won’t work that well.

One reason is that it is very slow, doing a lot of unnecessary work and being inefficient. Another reason is that if it encounters a complex problem or hits a weak spot, you are likely to fail at your task.

Vertical agents work great because they are highly specialized. They are tailored for very specific tasks. There are dozens of databases and over a hundred web development frameworks like Spring. There are also numerous web frameworks, such as Vue or React.

Dify focuses on using AI to connect workflows, employing a drag-and-connect method to define AI workflows. They need to do a lot to connect information, data, and platforms.

I have built some simple agents too, such as a Python code refactoring agent, a grammar fixing agent, a bug fixing agent, and an essay merging agent.

Code is very flexible. So, Dify only covers a small portion of the space of possible ideas.

Manus performs tasks and shows users how it works by using a VNC method to display a computer.

I think the future will settle on these two approaches.

For Manus, you need to upload code or text to perform tasks, which is not convenient. With Dify, you need to build workflows using drag and drop, similar to MIT Scratch.

Why isn’t Scratch as popular as Python? Because with Python, you can do so many things, while Scratch is limited to simple programs for educational purposes.

Dify probably has similar limitations.

Manus can handle a lot of simple tasks. However, for some tasks, especially those that hit Manus’s weaknesses, it will fail.

Also, many programs or services take time to set up. In Manus’s approach, this process is slow.

As a programmer, I use AI with Python to build my vertical agents. This is the simplest approach for me. I can also set up prompts and contexts to ensure relatively stable output from LLM APIs.

Manus and Dify are also built with these LLM APIs. Their advantage is that they already have a lot of tools or code ready to use.

If I want to build a Twitter bot agent, using Dify may be more convenient than building one myself with open-source technologies.

A Picky Engineer’s Take on AI Coding Tools

2025.08

Recently, I successfully ran Claude Code, so I want to share my tool selection journey. I have also collected some AI Tool Tips along the way.

I was quite late to adopt Claude Code.

Claude Code was released around the end of Feb 2025.

I didn’t succeed in trying it until recently. One reason is that it requires the Anthropic API, which doesn’t support Chinese Visa cards.

Another reason is that Claude Code Router became available, which made my recent attempt successful.

I keep hearing praise for it. I tried the Gemini CLI in July 2025 but abandoned it after several failed attempts to get it to fix my code.

I also tried Aider, another software agent. I stopped using Cursor after about six months because many of its VSCode-based plugins malfunctioned. Additionally, I don’t want to give Cursor much credit since it is built on top of VSCode. As the Copilot plugin in VSCode has recently improved and doesn’t lag far behind, I prefer to use it more often.

However, VSCode is built on Electron, an open-source technology. It’s challenging to attribute credit to the right team or individual. Considering that many large companies and startups profit from open-source projects, I must focus on my budget and what suits me best. I shouldn’t worry too much about giving credit. I prefer using affordable and effective tools.

I briefly experimented with Cline but didn’t adopt it.

I use the Copilot plugin in VSCode with a customized model, Grok 3 beta through OpenRouter, which works well.

I don’t think Claude Code will change my habits, but since I can successfully run it and have the patience to try it a few more times, I’ll see how I feel in the coming weeks.

I am a picky user with 10 years of software engineering experience. I hope tools can be great in actual use. I don’t buy into the brand—I just care about daily usefulness.

After using Claude Code to fix this post’s grammar, I’ve found it works well in certain scenarios. While I appreciate AI for grammar assistance (I even wrote a Python script to call LLM APIs for this purpose), I’ve noticed a frustrating pattern - even when I request minimal fixes, the tools keep surfacing numerous grammar suggestions for review. This manual verification process defeats the purpose of automation. As a compromise, I now let AI handle entire essays, though this approach limits my learning opportunities since I don’t see the specific corrections being made.

What impressed me most was how Claude Code displays changes - showing before-and-after comparisons similar to git diffs, which makes reviewing edits much easier.

After a day, I used Claude to fix some code as well. However, I continue to use the Copilot plugin with the Grok 3 beta model, as it is simple and easy for me.

After using Claude Code for several days, I have to say it’s very impressive. I really like how it fixes my code.

Source: Self-screenshot

Api Usage Of Deepseek And Mistral

2025.01.25

DeepSeek

In one month, 15 million tokens cost me approximately 23.5 CNY.

This was my usage in one day:

Type	Tokens
Input (Cache Hit)	946,816
Input (Cache Miss)	2,753,752
Output	3,100,977

The calculation is as follows:

0.94 * 0.1 + 2.75 * 1 + 3.10 * 2 = 11.83

So, depending on the task, the token usage largely depends on the input (cache miss) and output.

This result aligns with the expected cost.

DeepSeek API Pricing

Source: Self-Screenshot

Mistral

The pricing for Mistral models is as follows:

Model	Input (USD per million tokens)	Output (USD per million tokens)
`mistral-large-2411`	2	6
`mistral-small-latest`	0.2	0.6

In one day, my Mistral account usage was as follows (Model: mistral-large-2411):

Type	Tokens	Cost (USD)
Total	772,284	3.44
Output	474,855	2.85
Input	297,429	0.59

For the mistral-small-2409 model, the total usage was 1,022,407 tokens.

Assuming 1/3 of these were input tokens and 2/3 were output tokens:

There were 340,802 input tokens and 681,605 output tokens.

Therefore, the total cost is calculated as 340,802 * 0.2 / 1,000,000 + 681,605 * 0.6 / 1,000,000 = 0.07 + 0.41 = 0.48 USD.

The Mistral console reports a total usage cost of 0.43 USD, which approximately matches our calculation.

Grok

Model	Input (USD per million tokens)	Output (USD per million tokens)
`grok-2-latest`	2	10

Back Donate