Back

How Cursor Trained Composer on Fireworks

"The most kind of leveraged attribute of your application is the actual usage of user data or particular specific aspects of how this application works."

Watch the recap video here

Recap

  • 0:00-0:36 - : Fake RL environments can teach models tricks that fail in production.
  • 0:54-2:57 - : Cursor trained Composer 2 for one specialized task: software engineering inside Cursor.
  • 2:58-4:56 - : Application companies can evolve from off-the-shelf models to product-specific data and harness training.
  • 6:16-12:10 - : Composer 2 starts from Kimi K2.5, then uses code-heavy training and reinforcement learning.
  • 12:11-21:46 - : Distributed training and rollout infrastructure keep GPUs busy and move model updates efficiently.
  • 22:02-27:05 - : MoE routing mismatch can corrupt training, so the system needs router replay and matched kernels.
  • 27:06-33:36 - : Long-horizon agents need credit assignment and self-summarization.
  • 37:35-44:33 - : The strongest environment is the actual product, carefully isolated and scored.

Context

The source is a Sequoia Capital interview with Federico Cassano from Cursor and Dima from Fireworks about how Cursor trained Composer 2. Cursor is a developer tool company. Composer 2 is Cursor's coding model for long-horizon agent work inside Cursor. Fireworks provides infrastructure for high-throughput inference and reinforcement-learning rollouts. The discussion describes a two-stage training process: continued training on code-heavy data, then large-scale reinforcement learning on Cursor-like software-engineering tasks. The guests also discuss simulated environments, rewards, model-weight updates, distributed inference, and the role of the Cursor product harness.

Technical Need To Know

  • Reinforcement learning: A training method where a model tries tasks and receives rewards.
  • Rollout: One full model attempt inside an environment.
  • Harness: The product wrapper around the model: tools, commands, files, and execution flow.
  • Reward signal: The judgment used to score whether an attempt worked.
  • Simulated environment: A safe copy of the product setting for model experiments.
  • Mixture of Experts: A model design where only some expert subnetworks activate for each token.
  • Long-horizon agent: An agent that works across many steps and needs memory and credit assignment.

Nuanced Take

Application companies with enough usage and clear outcomes may train better, cheaper models for their own workflows. The moat comes from data, harness fidelity, and the ability to improve the model inside the product.