We introduce Reversal Q-Learning (RQL), a clean & performant method for flow-based reinforcement learning (RL) from prior data.
We treat individual flow refinement steps as RL actions, but naively doing so lengthens the value learning horizon, making it unsuitable for off-policy RL.
We find a solution which prevents this expansion via "flow reversal".
We find RQL obtains the best offline RL performance over 50 tasks compared to 19 state-of-the-art flow RL algorithms.
RQL learns a value function over both complete and partially-generated actions, finetuning intermediate flow steps via the action gradient.
It directly trains the full, expressive flow policy, without an unstable backpropagation through time or a one-step flow distillation.
A flow policy constructs actions via a sequence of refinement steps. To do RL, we can treat individual refinement steps as actions and apply standard RL algorithms.
However, this often does not work well empirically. We can directly do RL over refinement steps, but this expands each action into multiple decision steps, multiplying the value learning horizon.
This expansion is particularly bad for off-policy RL, which exhibits the ācurse of horizonā. It is also unclear how to utilize prior data, since standard datasets lack sequences of refinement steps.
We recognize we can prevent an expansion in the value learning horizon by constructing "virtual" flow trajectories from standard prior data that are perfectly suited for multi-step returns.
We generate trajectories in the expanded framework via flow reversal, which follows the flow ODE in reverse from actions in prior data.
We show these trajectories are deterministic and on-policy, and they thereby allow for unbiased, zero-variance multi-step returns.
This enables robust off-policy flow RL in the expanded framework.
We learn a value function jointly over complete actions and intermediate flow steps
where $x^{f}$ is computed with
We can then use reparameterized gradients on each flow step (alongside a BC term).
We are able to leverage the rich first-order gradient information of our value function to efficiently finetune individual flow steps of a flow policy.
That's it!
Existing off-policy methods circumvent the challenge of RL over the expanded framework by treating a flow policy as a black-box policy class, learning a value function over fully-generated actions and finetuning intermediate flow steps commonly via:
We show that RQL leads to substantially better performance than these prior work.
If you find this work useful, please cite:
@article{rql_oberai2026,
title = {Reversal Q-Learning},
author = {Aditya Oberai and Seohong Park and Sergey Levine},
conference = {arXiv Pre-print},
year = {2026},
url = {},
}