【DL輪読会】From Foresight to Forethought: VLM-In-The-Loop Policy Steering via Latent Alignment

>100 Views

July 24, 25

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 88K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 60.2K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 60.1K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 43.5K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 39.9K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 39.7K

各ページのテキスト

From Foresight to Forethought: VLM-In-The-Loop DEEP LEARNING JP Policy Steering via Latent Alignment [DL Papers] Presenter: Jeremy Siburian, Matsuo-Iwasawa Lab M1 http://deeplearning.jp/ 1

http://deeplearning.jp/

Paper Overview Paper Title From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment Authors Yilin Wu1, Ran Tian2, Gokul Swamy1, Andrea Bajcsy 1 (1Carnegie Mellon University, 2UC Berkeley) Conference Robotics: Science and Systems (RSS) 2025 Outstanding Paper Award at ICLR 2025 World Model Workshop Links ArXiv: https://arxiv.org/abs/2502.01828 Project Page: https://yilin-wu98.github.io/forewarn/ Presentation: https://iclr.cc/virtual/2025/10000108 Disclaimer: All credits for images, figures, and tables belong to the original authors. 2

Background: Generative Robot Policies Generative robot policies can learn complex, multimodal behaviours from demonstrations Physical Intelligence. (2025). π0.5: a VLA with Open-World Generalization. https://www.physicalintelligence.company/download/pi05.pdf However, at runtime, policies can exhibit degradations such as complete task failures or misaligned behaviours 3

Background: Policy Steering How to improve a base IL policy? Traditional method → finetune with additional intervention data or recovery behavior Alternative → Runtime policy steering do not require any additional data Policy steering as a stochastic model-predictive control framework: predicting the outcomes of a given action plan, and verifying how well they align with user intent Prediction Verification 4

Approach: FOREWARN Filtering Options via REpresenting World-model Action Rollouts via Narration Key Idea: Use world models to predict outcomes of low-level actions and latent-aligned VLM as an open-world reward function for evaluation 5

(A) Predicting Outcomes via Latent World Models • • • Dreamerv3 world model Pretrained via offline dataset consisting of both successful and failed rollouts from the base policy. After training, the world model is frozen and utilize the trained observation encoder and latent dynamics model. 6

(B) Latent-Text Alignment for Outcome Reasoning & Policy Steering • • • • Enable VLM to reason directly about predicted latent states Replace image tokenizer with world model encoder + dynamics model, add linear layer to align latent-text embedding VLM finetuning using a VQA dataset for fine-grained behavior narration VLM selects best action plan from the behavior narrations 7

Experiments Base Policy • Diffusion Policy (DP) • Trained on 100 teleop demonstrations per task 8

Results: Behavior Narration Performance • • GT Accuracy: A binary score (0 or 1) indicating whether the predictions match the ground-truth narrations. LLM Score: similarity score (ranging from 0 to 1) determined by the GPT-4o model. Baselines • FOREWARN : our proposed method. • FOREWARN-Oracle : assuming access to ground-truth future observations (instead of relying on the latent dynamics to predict future outcomes). VLM-Act : Similar to FOREWARN, but directly finetune the VLM to generate narrations from current observation and action plan without explicit world model. • VLM-Img : using a GPT-4o to generate behavior narrations in a zeroshot manner from the predicted visual observations from the world model. VLM-Img-Oracle : an upper-bound on the performance of VLM-Img, assuming access to ground-truth visual observations. 9

10.

Results: Policy Steering Performance Baselines • FOREWARN : The proposed method (world model prediction + VLM verification). • VLM-Act : Similar to FOREWARN, but directly finetune the VLM to generate narrations from current observation and action plan without explicit world model. • VLM-DynLat-Category : world model for prediction + VLM for action selection, but no behavior narration. • Classifier-Dyn-Latent : Directly takes predicted latent embeddings trains a transformer-based binary classifier instead of VLM. 10

11.

Results: Policy Steering Performance Takeaway 1: FOREWARN can effectively steer the policy towards safe and aligned behavior modes by leveraging the VLM as an interpreter and evaluator of predicted latent outcomes. 11

12.

Results: Policy Steering Performance Takeaway 2: Without explicitly training the VLM to interpret the predicted action outcomes from the latent space, the VLM’s steering performance will severely degrade under novel task scenario. 12

13.

Results: Policy Steering Performance Takeaway 3: Explicitly decoupling the policy steering problem as world model’s prediction and VLM’s verification allows for more effective policy steering. 13

14.

Results: Qualitative Examples 14

15.

Summary • A novel solution to formalize policy steering as a stochastic model-predictive control problem. • A latent-space alignment strategy between a world model and VLM for reliable verification of outcome prediction in shared representation space. 15