FT-Agent Automates LLM Fine-Tuning Across 13 Tasks

May 22, 2026 · Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian

A team of researchers has built a new benchmark called FT-Dojo to turn the tedious process of fine tuning large language models into a task that AI agents can handle on their own.…

A team of researchers has built a new benchmark called FT-Dojo to turn the tedious process of fine tuning large language models into a task that AI agents can handle on their own. Right now, getting a model ready for a specific domain like medicine or law involves a lot of manual labor: curating datasets, tweaking training parameters, and repeatedly checking if the model is actually improving. The researchers wanted to see if an autonomous agent could manage that whole pipeline end to end.

FT-Dojo isn't just another pile of static datasets. It's a full interactive environment with 13 tasks spread across 5 domains. It includes a shared raw data repository, a sandboxed place to run training code, structured feedback mechanisms, and a held out evaluation process. The idea is to give agents a standardized playground where they can experiment, fail, and learn from their mistakes. The team also built their own agent called FT-Agent, which uses structured iteration planning, a fail fast validation approach, and multi level feedback analysis to refine its data choices and training strategies.

The results are promising. FT-Agent outperformed other agents on 10 out of the 13 tasks, and the researchers ran controlled comparisons against frontier models and open source planning backbones to back up their claims. Case studies showed the agent could recover from failures by learning cumulatively, but it still struggled with diagnosing root causes and making long term plans.

What matters here is the shift toward treating model fine tuning as an interactive agent problem, not just a data engineering chore. If these agents keep improving, they could take a lot of the grunt work out of adapting LLMs for specialized uses. The code is already up on GitHub for anyone to try.

Original source