Reinforcement Learning (RL) has shown great potential in refining robotic manipulation policies, yet its efficacy remains strongly bottlenecked by the difficulty of designing generalizable reward functions. In this paper, we propose a framework for online policy refinement by adapting foundation VLMs into online reward generators. We develop a robust, scalable reward model based on a state-of-the-art VLM, trained on a large-scale, multi-source dataset encompassing real-world robot trajectories, human-object interactions, and diverse simulated environments. Unlike prior approaches that evaluate entire trajectories post-hoc, our method leverages the VLM to formulate a multifaceted reward signal comprising process, completion, and temporal contrastive rewards based on current visual observations. Initializing with a base policy trained via Imitation Learning (IL), we employ these VLM rewards to guide the model to correct sub-optimal behaviors in a closed-loop manner. We evaluate our framework on challenging long-horizon manipulation benchmarks requiring sequential execution and precise control. Crucially, our reward model operates in a purely zero-shot manner within these test environments. Experimental results demonstrate that our method significantly improves the success rate of the initial IL policy within just 30 RL iterations, demonstrating remarkable sample efficiency. This empirical evidence highlights that VLM-generated signals can provide reliable feedback to resolve execution errors, effectively eliminating the need for manual reward engineering and facilitating efficient online refinement for robot learning.
Stage 1: Multi-source Trajectory Processing. Diverse video trajectories from real-robot corpora, human-object interactions, and simulated benchmarks are curated and processed to establish a comprehensive dataset. This heterogeneous data pool provides extensive coverage of varied interaction patterns and complex task dynamics, enabling the model to extract the generalizable temporal features required for robust and accurate reward estimation.
Stage 2: Training Large Reward Models (LRMs) for reward generation. A Qwen3-VL-8B-Instruct backbone is fine-tuned via LoRA to transform the general-purpose VLM into a specialized Large Reward Model (LRM). This specialization stage yields three independent reward modalities: the Temporal Contrastive Reward (\(r_{\text{cont}}\)) for relative ranking, the Absolute Progress Reward (\(r_{\text{prog}}\)) for continuous estimation, and the Task Completion Reward (\(r_{\text{comp}}\)) for terminal state anchoring.
Stage 3: RL with LRM-generated rewards. During active interaction, the specialized LRM functions as a high-fidelity feedback engine, mapping visual observations \(I_t\) and task descriptions \(d\) into a dense reward stream. The policy \(\pi_\phi\) utilizes these physically-grounded supervisory signals to autonomously refine its control behaviors, enabling stable and high-precision robotic manipulation.
SFT Baseline
RL Fine-tuning with LRM-generated rewards
In real-world task, the SFT baseline (38.3% success) fails due to execution errors or distributional shifts. Our LRM-refined policy successfully corrects these failures, boosting the real-world success rate to 51.7%. These results consistently validate the effectiveness of our specialized LRM as a robust reward generation model for robotic reinforcement learning.
TBD