Sky-Drive: A VR-Enabled Human-in-the-Loop Multi-Agent Simulation Platform for Human-Centered AI and Transportation

Zilin Huang^1,†, Zihao Sheng^1,†, Zhengyang Wan^1,†, Yansong Qu², Yuhao Luo¹, Boyue Wang¹, Pei Li¹, Sikai Chen^1,*

1 University of Wisconsin-Madison, 2 Purdue University
^†Indicates Equal Contribution, ^*Corresponding Author

After almost 65 minutes (30K steps) of training, the AV agent learns how to drive safely and efficiency.

Abstract

In recent years, reinforcement learning (RL)-based methods for learning driving policies have gained increasing attention in the autonomous driving community and have achieved remarkable progress in various driving scenarios. However, traditional RL approaches rely on manually engineered rewards, which require extensive human effort and often lack generalizability. To address these limitations, we propose \textbf{VLM-RL}, a unified framework that integrates pre-trained Vision-Language Models (VLMs) with RL to generate reward signals using image observation and natural language goals. The core of VLM-RL is the contrasting language goal (CLG)-as-reward paradigm, which uses positive and negative language goals to generate semantic rewards. We further introduce a hierarchical reward synthesis approach that combines CLG-based semantic rewards with vehicle state information, improving reward stability and offering a more comprehensive reward signal. Additionally, a batch-processing technique is employed to optimize computational efficiency during training. Extensive experiments in the CARLA simulator demonstrate that VLM-RL outperforms state-of-the-art baselines, achieving a 10.5% reduction in collision rate, a 104.6% increase in route completion rate, and robust generalization to unseen driving scenarios. Furthermore, VLM-RL can seamlessly integrate almost any standard RL algorithms, potentially revolutionizing the existing RL paradigm that relies on manual reward engineering and enabling continuous performance improvements.

Overview

Comparative Overview of Reward Design Paradigms for Autonomous Driving. (a) Fundamentals and limitations of IL/RL-based methods for driving policy learning. (b) Fundamentals and limitations of foundation model-based reward design methods (i.e., LLM-as-Reward and VLM-as-Reward paradigms) for driving policy learning. (c) Our proposed VLM-RL framework, leverages VLMs to achieve a comprehensive and stable reward design for safe autonomous driving.

Sky-Drive: A VR-Enabled Human-in-the-Loop Multi-Agent Simulation Platform for Human-Centered AI and Transportation

After almost 65 minutes (30K steps) of training, the AV agent learns how to drive safely and efficiency.

Abstract

Overview

Visualization

(a) Comparison with State-of-the-art

VLM-RL (Town 02)

ChatScene-SAC (Town 02)

Revolve (Town 02)

(b) Comparison in Dense Traffic Flow Environment

(i) VLM-RL

Route 1 (Town 02 with dense traffic)

Route 2 (Town 02 with dense traffic)

Route 3 (Town 02 with dense traffic)

Route 4 (Town 02 with dense traffic)

Route 5 (Town 02 with dense traffic)

Route 6 (Town 02 with dense traffic)

Route 7 (Town 02 with dense traffic)

Route 8 (Town 02 with dense traffic)

Route 9 (Town 02 with dense traffic)

Route 10 (Town 02 with dense traffic)

VLM-RL achieves the best performance in dense traffic flow environment.

(ii) ChatScene-SAC

Route 1 (Town 02 with dense traffic)

Route 2 (Town 02 with dense traffic)

Route 3 (Town 02 with dense traffic)

Route 4 (Town 02 with dense traffic)

Route 5 (Town 02 with dense traffic)

Route 6 (Town 02 with dense traffic)

Route 7 (Town 02 with dense traffic)

Route 8 (Town 02 with dense traffic)

Route 9 (Town 02 with dense traffic)

Route 10 (Town 02 with dense traffic)

ChatScene-SAC achieves the middle performance in dense traffic flow environment.

(iii) Revolve

Route 1 (Town 02 with dense traffic)

Route 2 (Town 02 with dense traffic)

Route 3 (Town 02 with dense traffic)

Route 4 (Town 02 with dense traffic)

Route 5 (Town 02 with dense traffic)

Route 6 (Town 02 with dense traffic)

Route 7 (Town 02 with dense traffic)

Route 8 (Town 02 with dense traffic)

Route 9 (Town 02 with dense traffic)

Route 10 (Town 02 with dense traffic)

Revovle achieves the poor performance in dense traffic flow environment.

BibTeX

Route 1
(Town 02 with dense traffic)

Route 2
(Town 02 with dense traffic)

Route 3
(Town 02 with dense traffic)

Route 4
(Town 02 with dense traffic)

Route 5
(Town 02 with dense traffic)

Route 6
(Town 02 with dense traffic)

Route 7
(Town 02 with dense traffic)

Route 8
(Town 02 with dense traffic)

Route 9
(Town 02 with dense traffic)

Route 10
(Town 02 with dense traffic)

Route 1
(Town 02 with dense traffic)

Route 2
(Town 02 with dense traffic)

Route 3
(Town 02 with dense traffic)

Route 4
(Town 02 with dense traffic)

Route 5
(Town 02 with dense traffic)

Route 6
(Town 02 with dense traffic)

Route 7
(Town 02 with dense traffic)

Route 8
(Town 02 with dense traffic)

Route 9
(Town 02 with dense traffic)

Route 10
(Town 02 with dense traffic)

Route 1
(Town 02 with dense traffic)

Route 2
(Town 02 with dense traffic)

Route 3
(Town 02 with dense traffic)

Route 4
(Town 02 with dense traffic)

Route 5
(Town 02 with dense traffic)

Route 6
(Town 02 with dense traffic)

Route 7
(Town 02 with dense traffic)

Route 8
(Town 02 with dense traffic)

Route 9
(Town 02 with dense traffic)

Route 10
(Town 02 with dense traffic)