Together AI: Tri Dao and Platform

Together AI: Tri Dao and Platform | Generated by AI

Home 2026.06

Question: What is Together AI specifically about, and does Tri Dao work there?

Answer:

Yes — Tri Dao is a Co-Founder & Chief Scientist at Together AI, not just an employee. He’s one of the five co-founders alongside Vipul Ved Prakash (CEO), Ce Zhang (CTO), Chris Ré, and Percy Liang — all heavy hitters from Stanford/CMU systems + ML research.

What Together AI is:

Together AI is a full-stack AI platform for production AI, powered by cutting-edge systems research — helping teams ship faster, scale reliably, and achieve superior unit economics. More concretely, it’s three things:

Inference API — fast, cheap serving of open-source models (Llama, Qwen, Mistral, DeepSeek, etc.). Real-world benchmarks show 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.
Fine-tuning platform — managed fine-tuning on their GPU infra, recently upgraded with larger models and longer contexts.
Systems research — this is the differentiator. They published ATLAS (AdapTive-LeArning Speculator System), runtime-learning accelerators delivering up to 4x faster LLM inference, and do serious kernel-level work (Dan Fu is VP of Kernels).

Tri Dao’s specific role:

He’s also an Assistant Professor of Computer Science at Princeton, leading the Dao AI Lab, focused on hardware-aware algorithms and sequence models with long-range memory. His two defining works you should know:

FlashAttention / FlashAttention-2 — IO-aware exact attention, now standard in every serious training stack. FA-2 achieves 72% model FLOPs utilization on A100s and speeds up training by up to 4x vs the previous version.
Mamba (with Albert Gu) — SSM-based alternative to Transformers, relevant to the “beyond Transformers” wave. Together has been publishing on StripedHyena and hybrid architectures building on this.

Why it matters to you:

Together AI is one of the best alternatives to OpenAI’s API if you’re building on open-source models — competitive pricing, FlashAttention-optimized serving, and real systems research behind it (not just reselling GPU time). Given you’re already on OpenRouter + DeepSeek, Together is worth benchmarking for latency-sensitive workloads, especially for Llama/Qwen inference.

References:

Back Donate