RoboMME-Interference
Benchmarking Robot Memory Under Interference
A cross-session benchmark for memory-augmented VLAs
Independent Researcher
Built on RoboMME (Dai et al., ICML 2026). We contribute a cross-session interference benchmark over its released tasks and memory variants.
Same task (PatternLock), same memory system (FrameSamp-Modul). At
k0 the relevant demonstration is the only thing in the
history buffer and the policy reproduces the pattern. At
k7, with seven unrelated sessions inserted before that
demonstration, the same system fails. Closing that gap is what this
benchmark measures.
Abstract
Robots deployed in realistic settings accumulate experience across many sessions, tasks, and users. Current robot-memory evaluations focus on what a policy can recall within a single episode or short context. We introduce RoboMME-Interference, a cross-session benchmark over RoboMME task families that measures whether a policy can still use a relevant prior session once unrelated robot experiences are inserted into the history buffer before it.
We evaluate nine runnable memory-policy variants across nine task families and 18,450 rollouts. Perceptual memory systems improve success sharply when the relevant session is nearby, but the benefit decays as distractor sessions accumulate, and current systems largely fail to sustain it across sessions.
completed rollouts
task families
evaluated systems
history conditions
Contributions
-
A cross-session interference benchmark.
We extend RoboMME with a controllable history buffer: a relevant
prior demonstration followed by
k ∈ {0,1,3,7}unrelated sessions, turning memory distance into a measured variable. - A complete, released result grid. Nine RoboMME task families × nine released memory systems × five history conditions × 50 episodes = 18,450 rollouts.
-
A measurable failure mode.
Perceptual memory delivers large near-session gains (FrameSamp-Modul:
+27 ppatk0) that erode steadily under interference (−26 ppbyk7), while recurrent variants stay near baseline. Systems that look strong at short range can collapse across sessions.
The Benchmark
Every query episode is run through the same policy interface, varying only the external history buffer. The relevant lesson, a prior demonstration of the task, always appears first; the history conditions differ in how many unrelated sessions are inserted before the query reads it.
no-history: query only.
k0: the relevant prior session only.
k1/k3/k7: the relevant prior
session followed by 1, 3, or 7 unrelated distractor sessions. Larger
k pushes the relevant memory farther back.
Why these nine tasks. The protocol requires the task-relevant information to live in a self-contained prior session that can be lifted into the history buffer. We therefore use the nine RoboMME tasks that deliver this information as a prior demonstration video, distinct from execution; the remaining tasks embed their cue inside a single episode and have no separable prior session to place under interference.
Why unrelated distractors. Distractor sessions are drawn from different task families, so they cannot inject contradictory same-family facts. The benchmark therefore measures interference from genuinely unrelated robot experience, the deployment-like failure mode, rather than constructed wrong priors.
Tasks
Nine task families from RoboMME, each presenting a prior demonstration the policy must use during a separate execution episode. They span three memory types.
| Task | Memory type | What the prior session must supply |
|---|---|---|
| VideoRepick | Object | Which object was picked up in the demonstration, to re-pick it. |
| VideoPlaceOrder | Object | Which target in an ordered set the cube belongs on. |
| VideoPlaceButton | Object | The cube-to-target placement shown in the demonstration. |
| VideoUnmask | Spatial | Where an occluded cube of a given color is located. |
| VideoUnmaskSwap | Spatial | Occluded-cube location after the cubes have been swapped. |
| MoveCube | Procedural | The demonstrated manner of moving the cube to its target. |
| InsertPeg | Procedural | Which peg to grasp and how to insert it, from the demonstration. |
| PatternLock | Procedural | A continuous pattern traced over a grid, to be retraced. |
| RouteStick | Procedural | A path through obstacles, including turn directions, to repeat. |
Task definitions and assets are from RoboMME (Dai et al., 2026).
Memory Systems
We run RoboMME's released memory-augmented variants through the benchmark. Each pairs a memory representation (frames, tokens, or hidden states) with a mechanism for integrating that memory into the base π0.5 policy: Context concatenates memory tokens with the input, Modulator conditions the policy via adaptive LayerNorm, and Expert routes memory through a dedicated expert with block-wise causal attention. The perceptual variants (FrameSamp, TokenDrop) sample or drop visual-history tokens; the recurrent ones (TTT, RMT) keep a compressed hidden state. The baseline is the π0.5 backbone with no memory.
| Representation | Context | Modulator | Expert |
|---|---|---|---|
| FrameSamp perceptual | ✓ | ✓ | ✓ |
| TokenDrop perceptual | ✓ | ✓ | ✓ |
| TTT recurrent | ✓ | – | ✓ |
| RMT recurrent | – | – | – |
| Symbolic SimpleSG / GroundSG | skipped because it uses predefined subtask annotations | ||
✓ evaluated: eight memory variants plus the
π0.5 baseline (nine systems).
– not evaluated: RMT and TTT-Modulator had no released
checkpoints at evaluation time.
Results
1 Relevant memory helps — when it is near
At k0, with only the relevant session in history,
perceptual memory lifts success well above each system's own
no-history floor: FrameSamp-Modul from 18.2% to 45.3%
(+27.1 pp) and TokenDrop-Modul from 17.1% to 35.3%
(+18.2 pp), against the 17.3%
π0.5 baseline.
k0 gain is largest for FrameSamp-Modul and
TokenDrop-Modul, and falls back toward the no-history floor as
distractors are added.
2 Interference erodes the benefit
As unrelated sessions push the relevant memory farther back, the gain
collapses: FrameSamp-Modul drops about −26.0 pp
from k0 to k7 and TokenDrop-Modul about
−15.6 pp. By k7 most systems sit
close to where they started with no history at all.
3 What survives, and where
The decay is not uniform across systems or tasks. FrameSamp-Modul holds its margin over baseline longest and frame sampling proves more robust than token dropping, while recurrent TTT variants stay near the baseline throughout. Across tasks, the lift concentrates on easy and medium episodes; hard tasks stay near the floor for every system, since a policy can only act on a recalled session if it can do the task at all.
Per-family success rate
Where each system's memory benefit appears and how it survives interference, broken out by task family. Select any system.
Full grid
| System | No history | k0 | k1 | k3 | k7 |
|---|---|---|---|---|---|
| pi0.5 (baseline) | 17.3% | - | - | - | - |
| FrameSamp-Modul | 18.2% | 45.3% | 38.4% | 30.0% | 19.3% |
| TokenDrop-Modul | 17.1% | 35.3% | 30.9% | 23.6% | 19.8% |
| FrameSamp-Context | 17.8% | 26.7% | 18.7% | 18.7% | 17.8% |
| FrameSamp-Expert | 17.6% | 27.1% | 20.9% | 19.6% | 19.6% |
| TokenDrop-Context | 13.8% | 22.9% | 19.1% | 15.3% | 13.1% |
| TokenDrop-Expert | 16.9% | 27.3% | 20.2% | 15.6% | 14.0% |
| Recurrent-TTT-Expert | 16.0% | 18.0% | 16.4% | 16.2% | 15.6% |
| Recurrent-TTT-Context | 16.9% | 15.3% | 16.2% | 17.6% | 14.0% |
Success rates across all nine families.
Qualitative Examples
Watch one query episode under increasing interference. Within each
task, a system's clips run left to right as the relevant prior session
is pushed farther back: no-history, then k0
(relevant prior only), then k1, k3,
k7 with more unrelated sessions inserted before it. A green
outline marks a success in the canonical rollout table, and the rail
beneath each strip traces pass to fail.
Three representative tasks and a curated set of systems, chosen to show distinct memory behaviors. The full nine-task × nine-system grid is in the results table above and in the paper; these clips are illustrative and never contradict the canonical labels.
Loading qualitative examples…
Reproducibility & Artifacts
Everything needed to reproduce the grid and build on the protocol is released.
- Paper — the full method and protocol on arXiv.
- GitHub repository — code, analysis scripts, and figures.
- Canonical rollout CSV — all 18,450 rollouts, the source of truth.
- Main success table — per-system rates with confidence intervals.
- Results summary — headline numbers and effects.
Citation
@misc{rathi2026robommeinterference,
title = {Benchmarking Robot Memory Under Interference},
author = {Rathi, Soumil},
year = {2026},
eprint = {2606.22338},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2606.22338}
}
This benchmark builds on RoboMME; please cite it as well.
@article{dai2026robomme,
title = {RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies},
author = {Dai, Yinpei and Fu, Hongze and Lee, Jayjun and Liu, Yuejiang and
Zhang, Haoran and Yang, Jianing and Finn, Chelsea and Fazeli, Nima and Chai, Joyce},
journal = {arXiv preprint arXiv:2603.04639},
year = {2026}
}