>100 Views
July 24, 25
スライド概要
DL輪読会資料
From Foresight to Forethought: VLM-In-The-Loop DEEP LEARNING JP Policy Steering via Latent Alignment [DL Papers] Presenter: Jeremy Siburian, Matsuo-Iwasawa Lab M1 http://deeplearning.jp/ 1
Paper Overview Paper Title From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment Authors Yilin Wu1, Ran Tian2, Gokul Swamy1, Andrea Bajcsy 1 (1Carnegie Mellon University, 2UC Berkeley) Conference Robotics: Science and Systems (RSS) 2025 Outstanding Paper Award at ICLR 2025 World Model Workshop Links ArXiv: https://arxiv.org/abs/2502.01828 Project Page: https://yilin-wu98.github.io/forewarn/ Presentation: https://iclr.cc/virtual/2025/10000108 Disclaimer: All credits for images, figures, and tables belong to the original authors. 2
Background: Generative Robot Policies Generative robot policies can learn complex, multimodal behaviours from demonstrations Physical Intelligence. (2025). π0.5: a VLA with Open-World Generalization. https://www.physicalintelligence.company/download/pi05.pdf However, at runtime, policies can exhibit degradations such as complete task failures or misaligned behaviours 3
Background: Policy Steering How to improve a base IL policy? Traditional method → finetune with additional intervention data or recovery behavior Alternative → Runtime policy steering do not require any additional data Policy steering as a stochastic model-predictive control framework: predicting the outcomes of a given action plan, and verifying how well they align with user intent Prediction Verification 4
Approach: FOREWARN Filtering Options via REpresenting World-model Action Rollouts via Narration Key Idea: Use world models to predict outcomes of low-level actions and latent-aligned VLM as an open-world reward function for evaluation 5
(A) Predicting Outcomes via Latent World Models • • • Dreamerv3 world model Pretrained via offline dataset consisting of both successful and failed rollouts from the base policy. After training, the world model is frozen and utilize the trained observation encoder and latent dynamics model. 6
(B) Latent-Text Alignment for Outcome Reasoning & Policy Steering • • • • Enable VLM to reason directly about predicted latent states Replace image tokenizer with world model encoder + dynamics model, add linear layer to align latent-text embedding VLM finetuning using a VQA dataset for fine-grained behavior narration VLM selects best action plan from the behavior narrations 7
Experiments Base Policy • Diffusion Policy (DP) • Trained on 100 teleop demonstrations per task 8
Results: Behavior Narration Performance • • GT Accuracy: A binary score (0 or 1) indicating whether the predictions match the ground-truth narrations. LLM Score: similarity score (ranging from 0 to 1) determined by the GPT-4o model. Baselines • FOREWARN : our proposed method. • FOREWARN-Oracle : assuming access to ground-truth future observations (instead of relying on the latent dynamics to predict future outcomes). VLM-Act : Similar to FOREWARN, but directly finetune the VLM to generate narrations from current observation and action plan without explicit world model. • VLM-Img : using a GPT-4o to generate behavior narrations in a zeroshot manner from the predicted visual observations from the world model. VLM-Img-Oracle : an upper-bound on the performance of VLM-Img, assuming access to ground-truth visual observations. 9
Results: Policy Steering Performance Baselines • FOREWARN : The proposed method (world model prediction + VLM verification). • VLM-Act : Similar to FOREWARN, but directly finetune the VLM to generate narrations from current observation and action plan without explicit world model. • VLM-DynLat-Category : world model for prediction + VLM for action selection, but no behavior narration. • Classifier-Dyn-Latent : Directly takes predicted latent embeddings trains a transformer-based binary classifier instead of VLM. 10
Results: Policy Steering Performance Takeaway 1: FOREWARN can effectively steer the policy towards safe and aligned behavior modes by leveraging the VLM as an interpreter and evaluator of predicted latent outcomes. 11
Results: Policy Steering Performance Takeaway 2: Without explicitly training the VLM to interpret the predicted action outcomes from the latent space, the VLM’s steering performance will severely degrade under novel task scenario. 12
Results: Policy Steering Performance Takeaway 3: Explicitly decoupling the policy steering problem as world model’s prediction and VLM’s verification allows for more effective policy steering. 13
Results: Qualitative Examples 14
Summary • A novel solution to formalize policy steering as a stochastic model-predictive control problem. • A latent-space alignment strategy between a world model and VLM for reliable verification of outcome prediction in shared representation space. 15