FT-Dojo: Autonomous LLM Fine-Tuning for Vertical Domains

May 21, 2026 · Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian

A new research project from Microsoft aims to turn the tedious process of fine tuning large language models into something an AI agent can handle on its own.…

A new research project from Microsoft aims to turn the tedious process of fine tuning large language models into something an AI agent can handle on its own. The team introduced FT-Dojo, a benchmark environment that sets up 13 different fine tuning tasks across five domains. Instead of just giving you another stack of static datasets, FT-Dojo standardizes the whole workflow. It provides a shared raw data repository, a sandboxed execution environment, a structured feedback loop, and a held out evaluation procedure. The idea is to treat end to end LLM fine tuning as an interactive agent task rather than a manual chore.

The researchers also built FT-Agent, a purpose built autonomous framework that tries to do the job from start to finish. It uses structured iteration planning, a fail fast validation mechanism, and multi level feedback analysis to continuously refine data selection and training strategies. In tests, FT-Agent performed best on 10 out of 13 tasks, setting a strong initial baseline. The team also ran controlled comparisons against other frontier agents and open source planning backbones to back up their results.

One interesting finding from the case studies: the agents could actually learn from their own failures over time. They showed some ability to recover from mistakes through cumulative learning. But they also struggled with causal diagnosis and long horizon planning. So the system is promising but far from perfect. The implementation is available on GitHub for anyone to dig into.

What makes this notable is that it moves beyond the usual approach of building better models or bigger datasets. Instead, it treats the entire fine tuning pipeline as something an agent can learn to improve on its own. That shift could eventually save practitioners a lot of manual trial and error, though the current limitations suggest we are still in the early days.

Original source