Reversal Q-Learning

1University of California, Berkeley

TL;DR

We introduce Reversal Q-Learning (RQL), a clean & performant method for flow-based reinforcement learning (RL) from prior data.

We treat individual flow refinement steps as RL actions, but naively doing so lengthens the value learning horizon, making it unsuitable for off-policy RL.

We find a solution which prevents this expansion via "flow reversal".

Results

We find RQL obtains the best offline RL performance over 50 tasks compared to 19 state-of-the-art flow RL algorithms.

antmaze-giant
humanoid-large
scene
puzzle-4x4
cube-quadruple
Bar chart summary of success rates across methods and task groups.

Overview

RQL learns a value function over both complete and partially-generated actions, finetuning intermediate flow steps via the action gradient.

It directly trains the full, expressive flow policy, without an unstable backpropagation through time or a one-step flow distillation.

Background

A flow policy constructs actions via a sequence of refinement steps. To do RL, we can treat individual refinement steps as actions and apply standard RL algorithms.

Challenge

However, this often does not work well empirically. We can directly do RL over refinement steps, but this expands each action into multiple decision steps, multiplying the value learning horizon.

This expansion is particularly bad for off-policy RL, which exhibits the ā€œcurse of horizonā€. It is also unclear how to utilize prior data, since standard datasets lack sequences of refinement steps.

Solution

We recognize we can prevent an expansion in the value learning horizon by constructing "virtual" flow trajectories from standard prior data that are perfectly suited for multi-step returns.

Flow reversal schematic: intermediate trajectory segments from offline actions.

We generate trajectories in the expanded framework via flow reversal, which follows the flow ODE in reverse from actions in prior data.

We show these trajectories are deterministic and on-policy, and they thereby allow for unbiased, zero-variance multi-step returns.

This enables robust off-policy flow RL in the expanded framework.

Implementation

We learn a value function jointly over complete actions and intermediate flow steps

$$\mathcal{L}(V) = \mathbb{E}_{\widetilde{\tau}}\big[\ell_2^\kappa\big(V(s, x^f, f) - (r + \gamma V(s', x'^0, 0))\big)\big]$$

where $x^{f}$ is computed with

$$\underbrace{x^f}_{\mathrm{partially\ generated\ action}} = \underbrace{x^F}_{\mathrm{dataset\ action}} - \int^F_f{v(s, x_s, s)ds}$$

We can then use reparameterized gradients on each flow step (alongside a BC term).

$$\mathcal{L}(v) = \underbrace{-\mathbb{E}_{\widetilde{\tau}}[V(s, x^f + v(s, x^f, f), f+1)]}_{\mathrm{value\ maximization}} \underbrace{+\alpha\mathcal{L}^\mathrm{BC}(v)}_{\mathrm{behavioral\ regularization}}$$

We are able to leverage the rich first-order gradient information of our value function to efficiently finetune individual flow steps of a flow policy.

That's it!

Comparison

Existing off-policy methods circumvent the challenge of RL over the expanded framework by treating a flow policy as a black-box policy class, learning a value function over fully-generated actions and finetuning intermediate flow steps commonly via:

  1. backpropagation through time, which is unstable;
  2. one-step flow distillation, which inhibits expressivity;
  3. weighted regression, which empirically underperforms.

We show that RQL leads to substantially better performance than these prior work.

Citation

If you find this work useful, please cite:

@article{rql_oberai2026,
  title      = {Reversal Q-Learning},
  author     = {Aditya Oberai and Seohong Park and Sergey Levine},
  conference = {arXiv Pre-print},
  year       = {2026},
  url        = {},
}