Don’t lie to your friends
Don’t lie to your friends: Learning what you know from collaborative self-play
Research questions
The paper asks whether a language model can learn better meta-knowledge about what it knows by training in a multi-agent setting rather than through direct supervised examples. More specifically, it tests whether collaborative self-play can teach a model when to answer from parametric knowledge, when to use a retrieval tool, and when to hedge because it is uncertain. It also asks whether those skills learned in a multi-agent game transfer to a single-agent deployment setting. 
Method.
The authors use Gemma2-9B in a three-agent setup with one asker and two helper agents, where the helpers have different retrieval tools: one searches Wikipedia and the other searches PubMed. They train with Reinforced Self-Training on rollouts from this collaborative environment, using an effort-penalized F1 reward so that the system is rewarded for correct answers while being slightly penalized for unnecessary tool calls. Evaluation is then done in a single-agent setting on BioASQ and PopQA, with out-of-distribution tests on Natural Questions and EntityQuestions, and comparisons against in-context learning, an action-supervised baseline, and a deanonymized variant of collaborative self-play. 
Findings
The main result is that collaborative self-play matches or improves task F1 relative to in-context learning while using about 2 to 5 times fewer search calls on the in-distribution datasets. It also generalizes better than the action-supervised baseline on the out-of-distribution setting, where the action-supervised model searches too little and loses performance. On calibration, collaborative self-play improves the relationship between confidence behavior and actual parametric accuracy, although the directly action-supervised method does best on in-distribution calibration and the deanonymized version performs substantially worse. 
Limitations
The paper explicitly says that it only studies retrieval tools, factoid question answering tasks, and a single language model. The authors note that they have not yet tested richer tools such as calculators or code interpreters, more complex interactions such as question decomposition, or tasks beyond factoid QA where there may be no clean ground-truth validator. They also identify clarification and other grounding behaviors as longer-term directions that are not covered in the present experiments. 
Methodological failures
The biggest methodological weakness is external validity. The whole setup is engineered around a toy three-agent environment with highly structured actions like #ASK, #SEARCH, #ANSWER, and #HEDGE, so it is not yet clear how well the gains would survive in more natural conversational settings or in tasks where retrieval quality, uncertainty, and collaboration are messier. A second weakness is that the strongest claims depend on a very specific benchmark mix chosen to satisfy the paper’s theoretical conditions, including complementary tools and long-tail factoid questions, which means the positive results may partly reflect favorable task design rather than a broadly robust training principle. There is also some evaluation leakage in spirit, because the action-supervision baseline is tuned to the same answer-versus-hedge distinction used in evaluation, making parts of the comparison less clean as a test of general capability.