Day 93

Beyond the Black Box: Interpretability of Agentic AI Tool Use

May 07, 2026

Research questions. The paper asks whether agentic AI systems contain internal signals, before they act, that indicate whether a tool call is needed and how risky that tool action may be. It also asks whether these signals can reveal missed or unnecessary tool calls better than output-only monitoring.

Methodology. The authors build an interpretability toolkit using Sparse Autoencoders and linear probes on pre-action model activations. They train and test Tool-Need and Tool-Risk probes on multi-step NVIDIA Nemotron function-calling trajectories, then evaluate GPT-OSS 20B and Gemma 3 27B, including zero-shot transfer to BFCL.

Findings. The study finds that tool-decision signals are recoverable from internal activations, especially in later layers. GPT-OSS reached 75.3 percent Tool-Need accuracy and Gemma reached 71.4 percent, while Tool-Risk accuracy was 90.3 percent and 88.5 percent on held-out Nemotron tool rows.

Why it matters. The paper is important because agents increasingly act through tools, where mistakes can have real downstream consequences before humans see the final output. It suggests that mechanistic interpretability could support real-time internal monitoring of tool use, risk, and failure modes in agent systems.

← All Projects