Recent advances in hierarchical robot systems leverage a high-level planner to propose task plans and a low-level policy to generate robot actions. This design allows training the planner on action-free or even non-robot data sources (e.g., videos), providing transferable high-level guidance. Nevertheless, grounding these high-level plans into executable actions remains challenging, especially with the limited availability of high-quality robot data. To this end, we propose to improve the low-level policy through online interactions. Specifically, our approach collects online rollouts, retrospectively annotates the corresponding high-level goals from achieved outcomes, and aggregates these hindsight-relabeled experiences to update a goal-conditioned imitation policy. Our method, Hindsight Flow-conditioned Online Imitation (HinFlow), instantiates this idea with 2D point flows as the high-level planner. Across diverse manipulation tasks in both simulation and physical world, our method achieves more than 2× performance improvement over the base policy, significantly outperforming the existing methods. Moreover, our framework enables policy acquisition from planners trained on cross-embodiment video data, demonstrating its potential for scalable and transferable robot learning.

Flow of points is a general representation of these high-level plans, describing the predicted state as future keypoint trajectories. However, translating these flow plans into robust and scalable low-level policies remains a critical challenge.
HinFlow learns this translation from the robot's interactions with the environment. The core insight is that even if the collected experiences fail to reach the planner's precise goal, they can be repurposed for self-imitation by framing the achieved flows as the intended goal. Generating such supervision directly from the robot's own imperfect experiences allows it to learn and adapt without relying on a large volume of expert data.
LIBERO
ManiSkill
A high-level planner is trained on a large action-free dataset from a source arm together with only five labeled demonstrations from a target arm. Then HinFlow learns a control policy for the target arm under the planner's guidance.
Cross-embodiment data noticeably improves the planner (e.g., giving explicit guidance for lifting the book, or directing the gripper to the correct grasping location). HinFlow's online self-improvement effectively grounds those planner gains into a more robust policy.
Planner Dataset
~300 Action-free Franka Videos
+
5 Action-labeled Kinova Demos
After Online Interactions on Kinova
W/ Cross-embodiment Data: 48.1%
+
W/o Cross-embodiment Data: 0.6%
Planner Dataset
~300 Action-free Franka Videos
+
5 Action-labeled xArm Demos
After Online Interactions on xArm
W/ Cross-embodiment Data: 61.3%
+
W/o Cross-embodiment Data: 24.4%
On a real-world mouse pick-and-place task, HinFlow improves the success rate from 8/20 to 19/20 within only 86 online interaction trajectories. The mouse is randomly reset within a 15cm x 15cm area.
@inproceedings{zheng2026translating,
title={Translating Flow to Policy via Hindsight Online Imitation},
author={Zheng, Yitian and Ye, Zhangchen and Dong, Weijun and Wang, Shengjie and Liu, Yuyang and Zhang, Chongjie and Wen, Chuan and Gao, Yang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}