A Mentor-Mentee Conversation on AI Research and Life Direction | Original

Home 2026.05

This article is compiled from the audio transcript of an online meeting. The two participants were a senior programmer and a high-school AI researcher (Ruixiu Zhang). Because the original transcript lost speaker labels, the content below is reorganized from a third-person perspective (“they”), preserving the colloquial style and core ideas of the original conversation, while removing repetition, small talk, and unrelated fragments, and grouping things by topic.

Note: this article was produced via Whisper transcription followed by Claude Code (Opus 4.7) reorganization. Some details may contain minor or trivial factual inaccuracies — please verify any important information independently before relying on it.


1. Communication Efficiency and the “Project Walkthrough” as an Opener

The meeting opened with a very practical observation: repeated communication is inefficient.

The senior mentioned that he had just finished an interview where he opened his blog and walked through who he was in about ten minutes — that, in his view, is what efficient communication looks like. He suggested they record this meeting, transcribe it, and the next time a middle- or high-school student came along with the same questions (how to do projects, how to package a resume), they could simply send over the organized material instead of explaining everything from scratch.

He pointed this out specifically: a lot of what you say to professors and to many other people is the same content repeated over and over. That state of affairs is fundamentally inefficient.

For the meeting tool itself, they were on Tencent Meeting, but the senior recommended switching to Teams next time, mentioning he also has Zoom. The reasoning: these are products built with hundreds of millions, even billions of dollars — getting comfortable with an industrial-grade collaboration tool is itself part of “living in the future.”


2. Why “Low-Level Algorithms” and “Building Products” Are Two Different Paths

On direction, the senior first shared an observation about another student (Yangyang): that student works on optimization within the NVIDIA stack (VRAM, memory movement, kernel-level engineering). That work is essentially ops and engineering — it sometimes leads to critical discovery too, like FlashAttention.

The high-school student’s stance was clear:

The senior’s view: both paths are valid, but you have to know clearly which one you’re on.

He then said something fairly blunt — paraphrasing: growing up in the Chinese environment means a lot of effort already goes in, but there is always sky beyond the sky. Compared with the top 16-year-olds in Silicon Valley, the gap mainly shows up in two places:

  1. Practical English expression — extremely proficient, as fluent as Chinese;
  2. Computer skills are roughly comparable, but in a high-intensity communication environment like OpenAI, if your English comprehension isn’t as efficient as your Chinese, you’ll be a bit slower explaining “very complicated transformer discussion” — and that alone can let another candidate edge past you, even if their thinking isn’t sharper or their AI fundamentals aren’t more solid. Communication is itself a hard metric.

So he kept emphasizing: English fluency isn’t about English — it’s about not suffering in the future.


3. Tree of Thoughts Project: A Full Code Walkthrough

The technical core of the meeting was the high-school student sharing his own Tree of Thoughts physics-problem-solving system. He ran it locally and walked through the architecture step by step.

3.1 Model Division of Labor: Four Roles

The system is a collaboration of four kinds of models:

Role Size Responsibility
Planning Model (Orchestrator) Large (120B) Decompose the problem into meta-task → task, planning step by step
Modeling Model Medium (9B) Strictly follow the planner’s instructions, never shown the full problem
Review Model Small (0.5B / 4B) Inspect each step’s logic, decide noise / send back / pass
Evaluation Model Small Subjective scoring, e.g. whether the final formula is clean enough

The high-school student explained a key design decision: why must the Modeling Model not see the full problem?

If you let it see the original problem plus the decomposed sub-task, it can’t help itself — it will try to “solve the whole thing in one shot” instead of following the decomposition. So you must feed it only the instruction for the current step.

The senior’s comment: “That’s exactly right — you have to give it something precise, strip out everything it doesn’t need, otherwise it will misunderstand you.”

3.2 Why FSM (Finite State Machine) Instead of a Simple Data Structure

Review uses an FSM-based state-management approach. The high-school student’s reasoning:

3.3 On “Diverging” and “Converging”

He encourages the model to diverge as much as possible at the earliest nodes, e.g. generating 5–7 different approaches; later on, if a branch is just adding a correction term (such as drag), there’s no need to diverge further — the review model decides this on its own.

A humanities problem might have the opposite shape: in the stage of rebutting an existing argument, you may actually need to diverge more late in the process. So the shape of the “thought tree” is itself a reflection of the discipline’s character.

3.4 Skill System and External SQL

He externalizes all “domain knowledge”:

3.5 Limitations of the Orchestrator and Why Start with a Small Model

The final answer the demo produced wasn’t especially accurate (e.g. one problem ended with something like v_eff × h, a form that wasn’t precise enough). He was upfront about it:

“Open-source small models are still too weak. If I called an API, the result would definitely be better.”

But he stressed a working method that the senior strongly endorsed:

First, get the entire harness running on a local small model, instead of jumping straight to an API.

Small models expose the small pitfalls you’d never see when calling an API; once the whole pipeline is running, switching to a bigger model can only make it better, not worse.


4. Code Organization and Codex / Copilot Habits

They went through several files together: backend / scheduler / models / planning_model / utils. Some observations:

Tooling suggestions from the senior:


5. Mini-LLM Project: A nanoGPT-Style Training Implementation

Next they looked at the second project — a mini-LM training implementation modeled on nanoGPT.

5.1 Multiple Versions and Hardware Adaptation

The reason there are several versions in the repo: he ran the same code on different hardware:

Different hardware requires different precision, batch size, and parallelism strategies; the version he kept is the A100 one.

5.2 A Few Details Quizzed on the Spot

The senior ran through Transformer details quiz-style:

5.3 Data Scale and Hardware Upgrade


6. The Third Project: An Agentic Coding Experiment

This is a mini “multi-agent collaborative coding” system the high-school student built, inspired by Cloud Code:

He rates this project as a one-off exploration that he didn’t continue maintaining, but he did get small tasks like “Snake game code” running through it.


7. On “Proving Your Level” — Important Advice from the Senior

When the conversation turned to RoPE and KQV, the senior asked a slightly pointed question:

“You’re saying you’re stronger than that Deng Mingyang or the Kimi guy, but I’m actually skeptical.”

He elaborated: it isn’t that he doesn’t believe in the potential; the point is that outsiders have no credible way to judge it. To get others to believe you’re stronger than some 0xy / IOI medalist, there are usually a few paths:

  1. Have repeated conversations with someone who already knows you reasonably well, so they can see your grasp of LM knowledge across multiple points in time;
  2. Let the resume speak for itself — at a glance, the reader knows this person is stronger than X;
  3. Get identified on the spot in an interview by a sufficiently senior interviewer (he gave the example of his own morning interview that day for a Standard Chartered overseas role — half an hour in, the interviewer signaled he had passed).

The point of saying this isn’t to deflate, it’s to remind: “your level” can’t stay as a private feeling. It needs a vehicle others can see.

He also touched on something interesting — a name itself can become a brand:

“When people hear your name and immediately know you’re a top-tier person, you’ve fully arrived — your name becomes your brand.”


8. On “Reversing Myopia” — A Side Research Topic of the Senior’s

Mid-meeting, the senior shared a non-mainstream topic he’d been working on for years — reversibility of myopia. The brief points:

He drew a parallel: this is essentially the same as how he understands GPT —

“In 2017 there weren’t many people discussing Transformers either; the things that genuinely work are, in their early stages, only known to a small number of people.”

And he reminded him that as someone doing heavy near-work daily, this is worth a few minutes of attention even more than the project work.


9. On Network, Foreign-Language Environment, and Independent Thinking

They went down the high-school student’s WeChat group roster line by line:

When the conversation turned to Wang Yin (Wang Yin), Daniel P. Friedman, and The Little Schemer, the senior spent a lot of time explaining what he means by “independent thinking”:

He directed this point at the high-school student: why so many things must be thought up by yourself rather than copied from others — that is the foundation of everything that comes later.


10. The Senior’s Own Founding Experience (As a Reference Point)

To put a footnote on “project credibility,” the senior described a WeChat mini-program live trivia project he ran around 2016:

He then told the early-Airbnb story — the founders meeting with an investor over coffee, the investor leaving for the bathroom and never coming back. He wasn’t complaining; he wanted the high-school student to know in advance:

“The world is very transactional. When you don’t have a name, the cold rejections will be many — you have to be mentally prepared for that.”


11. A Quick Math Quiz

A short quiz session was sprinkled in:


12. AI Tools and Development Habits Summary

Scattered but valuable points:


13. Collaboration Plan and Next Meeting

As the meeting wound down, the senior put forward a concrete collaboration idea:

“I’ve already trained GPT-2 myself. We should team up and train a GPT-3. If we can also train a GPT-4 (at the 2022 level), we’d be near the front of the AI field.”

The high-school student’s reply kept its engineering sobriety: scale up the parameters and you no longer need a single machine — you need a cluster. The senior thinks platforms like Runpod can handle the cluster side; the bottleneck remains the same three things — data, compute, algorithms.

For the next meeting, they agreed:

  1. Switch to Teams;
  2. Walk through Transformer / KQV again — that is the part most often asked in interviews;
  3. The high-school student to send a myopia exam report;
  4. The high-school student to finish reading the few Wang Yin articles the senior recommended.

Closing

This conversation covered a wide range: from a 16-year-old’s Tree-of-Thoughts physics-reasoning framework, to live questioning on RoPE and KQV, to a Silicon Valley founder anecdote, to myopia reversal and a methodology of “independent thinking.”

If you had to compress it into one line, it would be roughly this:

Capability itself is necessary; but beyond capability, you need a vehicle that lets others see it — a resume, a project, a name, English fluency, and the steady judgment others form of you across multiple meetings.

On the technical side, both agreed: first get the pipeline running on a small model, then switch to a bigger one; let the model perform within the boundaries you design, instead of letting it think on your behalf. Whether building Tree of Thoughts or a mini-LLM, fundamentally these are the same thing.


Back Donate