VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving

1 University of Wisconsin-Madison, 2 Purdue University
Indicates Equal Contribution, *Corresponding Author

Route 1 (Town 02)

Route 2 (Town 02)

Route 3 (Town 02)

Route 4 (Town 02)

Route 5 (Town 02)

Route 6 (Town 02)

Route 7 (Town 02)

Route 8 (Town 02)

Route 9 (Town 02)

Route 10 (Town 02)

To the best of our knowledge, VLM-RL is the first work in the autonomous driving field to unify VLMs with RL for end-to-end driving policy learning in the CARLA simulator.

Abstract

In recent years, reinforcement learning (RL)-based methods for learning driving policies have gained increasing attention in the autonomous driving community and have achieved remarkable progress in various driving scenarios. However, traditional RL approaches rely on manually engineered rewards, which require extensive human effort and often lack generalizability. To address these limitations, we propose \textbf{VLM-RL}, a unified framework that integrates pre-trained Vision-Language Models (VLMs) with RL to generate reward signals using image observation and natural language goals. The core of VLM-RL is the contrasting language goal (CLG)-as-reward paradigm, which uses positive and negative language goals to generate semantic rewards. We further introduce a hierarchical reward synthesis approach that combines CLG-based semantic rewards with vehicle state information, improving reward stability and offering a more comprehensive reward signal. Additionally, a batch-processing technique is employed to optimize computational efficiency during training. Extensive experiments in the CARLA simulator demonstrate that VLM-RL outperforms state-of-the-art baselines, achieving a 10.5% reduction in collision rate, a 104.6% increase in route completion rate, and robust generalization to unseen driving scenarios. Furthermore, VLM-RL can seamlessly integrate almost any standard RL algorithms, potentially revolutionizing the existing RL paradigm that relies on manual reward engineering and enabling continuous performance improvements.

Introduction Video


Overview


Descriptive Alt Text

Comparative Overview of Reward Design Paradigms for Autonomous Driving. (a) Fundamentals and limitations of IL/RL-based methods for driving policy learning. (b) Fundamentals and limitations of foundation model-based reward design methods (i.e., LLM-as-Reward and VLM-as-Reward paradigms) for driving policy learning. (c) Our proposed VLM-RL framework, leverages VLMs to achieve a comprehensive and stable reward design for safe autonomous driving.

Motivation


Descriptive Alt Text

Conceptual comparisons of reward design paradigms. (a) Robotic manipulation tasks often feature well-defined goals (e.g., “Put carrot in bowl”), enabling VLMs to provide clear semantic rewards. (b) Existing methods that use only negative goals (e.g., “two cars have collided”) focus on avoidance but lack positive guidance. (c) Our CLG-as-Reward paradigm integrates both positive and negative goals, allowing VLM-RL to deliver comprehensive semantic guidance for safer, more generalizable driving.

Framework


Descriptive Alt Text

Architecture of the VLM-RL Framework for Autonomous Driving. (a) Observation and action spaces for policy learning; (b) Definition of contrasting language goals (CLG) to provide semantic guidance; (c) CLG-based semantic reward computation using pre-trained VLMs; (d) Hierarchical reward synthesis that integrates semantic rewards with vehicle state information for comprehensive and stable reward signals; (e) Policy training with batch processing, where SAC updates are performed using experiences stored in a replay buffer and rewards are computed asynchronously to optimize efficiency.

Experiment


Descriptive Alt Text

Table 1 shows that VLM-RL outperforms both expert-designed and language model-based reward methods, achieving a balance between safety and efficiency. VLM-RL reaches a higher average speed (17.4 ) and the best route completion (4.4) while maintaining a low collision rate (0.68) and collision speed (2.6). In comparison, expert-designed methods like TIRL fail to progress meaningfully, and Chen-SAC sacrifices safety for speed. Language model-based methods such as Revolve and LORD perform moderately but lack the stability and generalization of VLM-RL.

Visualization

(a) Comparison with State-of-the-art

VLM-RL (Town 02)

ChatScene-SAC (Town 02)

Revolve (Town 02)


(b) Comparison in Dense Traffic Flow Environment

(i) VLM-RL

Route 1
(Town 02 with dense traffic)

Route 2
(Town 02 with dense traffic)

Route 3
(Town 02 with dense traffic)

Route 4
(Town 02 with dense traffic)

Route 5
(Town 02 with dense traffic)

Route 6
(Town 02 with dense traffic)

Route 7
(Town 02 with dense traffic)

Route 8
(Town 02 with dense traffic)

Route 9
(Town 02 with dense traffic)

Route 10
(Town 02 with dense traffic)

VLM-RL achieves the best performance in dense traffic flow environment.

(ii) ChatScene-SAC

Route 1
(Town 02 with dense traffic)

Route 2
(Town 02 with dense traffic)

Route 3
(Town 02 with dense traffic)

Route 4
(Town 02 with dense traffic)

Route 5
(Town 02 with dense traffic)

Route 6
(Town 02 with dense traffic)

Route 7
(Town 02 with dense traffic)

Route 8
(Town 02 with dense traffic)

Route 9
(Town 02 with dense traffic)

Route 10
(Town 02 with dense traffic)

ChatScene-SAC achieves the middle performance in dense traffic flow environment.

(iii) Revolve

Route 1
(Town 02 with dense traffic)

Route 2
(Town 02 with dense traffic)

Route 3
(Town 02 with dense traffic)

Route 4
(Town 02 with dense traffic)

Route 5
(Town 02 with dense traffic)

Route 6
(Town 02 with dense traffic)

Route 7
(Town 02 with dense traffic)

Route 8
(Town 02 with dense traffic)

Route 9
(Town 02 with dense traffic)

Route 10
(Town 02 with dense traffic)

Revovle achieves the poor performance in dense traffic flow environment.


(c) Compare in different towns

(i) Town 01

VLM-RL (Town 01)

ChatScene-SAC (Town 01)

Revolve (Town 01)

(ii) Town 03

VLM-RL (Town 03)

ChatScene-SAC (Town 03)

Revolve (Town 03)

(iii) Town 04

VLM-RL (Town 04)

ChatScene-SAC (Town 04)

Revolve (Town 04)

(iv) Town 05

VLM-RL (Town 05)

ChatScene-SAC (Town 05)

Revolve (Town 05)

BibTeX

@article{huang2024vlm,
        title={VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving},
        author={Huang, Zilin and Sheng, Zihao and Qu, Yansong and You, Junwei and Chen, Sikai},
        journal={arXiv preprint arXiv:2412.15544},
        year={2024}
      }