Preprint · 2026 · Agentic Video Understanding

Confidence-Aware Tool Orchestration
for Robust Video Understanding

1NTU Singapore  ·  2University of Minnesota–Twin Cities  ·  3UNIST
✉ Correspondence: jaehong.yoon@ntu.edu.sg
framework.png  ·  the Robust-TO architecture  PDF ↗
The Robust-TO framework: quality-assured reliable frame selection, confidence-guided tool routing, confidence-weighted evidence synthesis, and GRPO training.

Figure 1. The Robust-TO framework. A frozen, parameter-free profiler scores every frame for blur, brightness, and occlusion and keeps only a reliable subset (Quality-Assured Reliable Frame Selection). The query q is decomposed into sub-queries that a disturbance-aware Router dispatches to cost-annotated tools — track_temporal, detect_objects, read_text, … — each returning a unified (result, confidence, source) triple. Confidence-Weighted Evidence Synthesis then sorts that evidence into High / Mid / Low tiers to compose a calibrated answer. The policy is trained end-to-end with GRPO under a combined sub-query-efficiency + confidence-cost reward.


Abstract

Threading trustworthiness through every reasoning step

Robust-TO is an agentic framework that makes per-frame trustworthiness a first-class signal across the entire video reasoning pipeline. A unified (result, confidence) interface couples each tool's intrinsic certainty with a parameter-free disturbance estimate, enabling reliability-guided frame selection, disturbance-aware tool routing, and three-tier evidence synthesis.

Optimized with a confidence-cost GRPO reward that jointly balances correctness, evidence reliability, and compute, Robust-TO delivers +10.6 average accuracy over the strongest open-source baseline, the smallest clean-to-corrupted accuracy drop among all compared methods, and <5% latency overhead on clean inputs — degrading gracefully rather than failing silently.

TL;DR

Video-LLMs trust every frame equally and never report when they shouldn't. Robust-TO teaches an agent to know what it cannot see, route around it, and answer with calibrated confidence — turning catastrophic failure under real-world corruption into graceful, measurable degradation.

+10.6
Avg. accuracy gain
vs. strongest OSS
3.0Δ
Clean→corrupted drop
smallest of all methods
<5%
Latency overhead
on clean inputs
8
Tasks across
2 benchmarks

Motivation

Why frontier Video-LLMs fail silently

Embodied agents see motion blur, glare, occlusion, low light, and sensor noise constantly. Under these conditions, current Video-LLMs lose 15–30 accuracy points on embodied benchmarks — but their internal confidence does not move. A downstream planner has no signal that the perception it depends on has quietly become unreliable.

Existing robustness work hardens a single monolithic model. Robust-TO takes the opposite view: keep the tools, but make the agent reason about their trustworthiness. Every observation carries an explicit reliability estimate, and that estimate shapes which frames are read, which tools are called, and how evidence is fused into an answer. Figure 2 traces this loop on a single corrupted flight.

motivation.png  ·  a confidence-aware agent on a corrupted flight  PDF ↗
Motivating workflow: assess per-frame quality, keep trustworthy frames, route sub-queries to suitable tools, and fuse tiered evidence into a confident answer.

Figure 2. Motivating workflow. Given a degraded drone flight and the query “What major landmarks does the drone pass, and in what order?”, Robust-TO first assesses per-frame quality, keeps only the top-k trustworthy frames, routes each sub-query to the tool best suited to it, and fuses tiered evidence (High / Mid / Low) into an answer it can label HIGH confidence — rather than trusting every corrupted frame equally. This confidence-aware loop is precisely what converts silent failure into graceful, measurable degradation.


Method

The Robust-TO pipeline

A single trustworthiness signal flows from raw frames to the final answer. Each stage consumes the previous stage's confidence and emits its own — the agent never reasons over an observation without knowing how much to trust it.

The complete module-level architecture — frame profiling, the disturbance-aware router, three-tier synthesis, and the GRPO reward — is given in Figure 1. Below we distill the four design principles that make that flow robust.

Core mechanisms

Four design principles

Principle 01

Reliability is relative

Frames are scored by reliability × relevance; corrupted-but-irrelevant frames and clean-but-uninformative frames are both pruned before any tool fires.

Principle 02

Route to the robust tool

Blur favors caption_frame; occlusion favors recognize_action; low light favors enhanced detect_objects — corruption type selects the tool, not the other way around.

Principle 03

Doubt is structured

Contradictory MEDIUM-tier facts are discarded rather than averaged, so a robust answer emerges from the evidence that actually holds up.

Principle 04

Confidence can't be gamed

A frozen N* disturbance estimator anchors the GRPO reward — the agent is rewarded for being right and calibrated, never for bluffing.


Interactive

Watch trustworthiness drive the agent

Apply real-world corruptions to an embodied navigation frame and observe how Robust-TO re-estimates reliability, re-routes to a robust tool, and adjusts its evidence tier — while a blind baseline's confidence stays flat. Toggle corruptions and drag the intensity.

EXIT
INPUT · clean
→ default route
Inject corruption
Motion Blur Glare Occlusion Low-Light Gaussian Noise
55%
Per-frame reliability0.96
Blind baseline confidence0.93
HIGH TIER routed → detect_objects
Robust-TO answer: EXIT — driven by HIGH-confidence evidence.
The contrast that matters: the gray baseline confidence barely moves as the frame degrades, while Robust-TO's reliability (and its evidence tier) tracks the actual damage — the difference between a planner that knows it's blind and one that doesn't.

Experiments

State-of-the-art robustness and accuracy

Evaluated on UrbanVideo-Bench (outdoor embodied: LP, CF, PE, AG) and VSI-Bench (indoor spatial: RDist, RDir, RP, AO) under clean and RoVA-V1 corrupted variants — Motion Blur, Glare, Occlusion, Low-Light, and Gaussian Noise. Best per column in bold; marks our full model.

Table 1 · Main results — 8 embodied spatial-reasoning tasks PNG ↗
Table 1: per-task accuracy across UrbanVideo-Bench and VSI-Bench. Robust-TO + Qwen3-VL-7B reaches 56.4 average, +10.6 over the strongest fine-tuned baseline, using only 20.7 frames.

Reading Table 1. Across all eight UrbanVideo-Bench and VSI-Bench tasks, Robust-TO + Qwen3-VL-7B reaches 56.4 average accuracy+10.6 over the strongest open-source / fine-tuned baseline (Qwen2.5-VL-7B SFT at 45.8) and ahead of every proprietary and SOTA-reasoning model, including Gemini-2.5-Pro (46.2). It posts the best score on six of the eight tasks (e.g. Appearance Order 77.5, Landmark Position 61.1, Action Generation 59.0) while reading only 20.7 frames on average versus 32 for the baselines — higher accuracy at lower cost.

Table 2 · Accuracy under RoVA-V1 corruptions UrbanVideo-Bench · accuracy (%)
Method Motion BlurGlareOcclusionLow-LightG. NoiseAvg.
GPT-4o34.032.030.034.530.532.2
Gemini-2.5-Pro39.037.835.240.538.038.1
Video-R1 + Qwen3-VL-7B49.548.045.850.249.048.5
Robust-TO + Qwen2.5-VL-7B48.046.844.548.547.747.1
Robust-TO + Qwen3-VL-7B55.254.051.856.054.554.3
Graceful degradation

Smallest drop, fewer frames, lower latency

CLEAN → CORRUPTED DROP (Δ)
3.0
Robust-TO+Qwen3 vs. 3.5 Video-R1 · 6.2 Gemini
FRAMES PROCESSED
−35%
reliability×relevance selection prunes redundant frames
INFERENCE TIME
−35%
<5% overhead added on clean inputs

Robust-TO + Qwen3-VL-7B sets the best result on every task and every corruption type, while exhibiting the smallest clean-to-corrupted accuracy drop (Δ = 3.0) among all compared methods. Crucially, robustness does not cost compute: by reading 35% fewer frames it cuts inference time by over 35% and adds under 5% overhead on clean inputs — graceful degradation that is essentially free.


Manuscript

Read the full paper

The preprint includes all derivations, the full ablation suite, per-tool implementation details, and qualitative case studies under concurrent corruptions.

Robust-TO.pdfOpen in browser ↗

Reference

BibTeX

If you find Robust-TO useful for your research, please consider citing:

bibtex
@article{he2026robustto,
  title   = {Confidence-Aware Tool Orchestration for Robust Video Understanding},
  author  = {He, Yangfan and Choi, Yujin and Yoon, Jaehong},
  year    = {2026},
  journal = {arXiv preprint arXiv:2606.26904},
  url     = {https://arxiv.org/abs/2606.26904},
}