RoboMME-Interference

Benchmarking Robot Memory Under Interference

A cross-session benchmark for memory-augmented VLAs

Soumil Rathi

Independent Researcher

Built on RoboMME (Dai et al., ICML 2026). We contribute a cross-session interference benchmark over its released tasks and memory variants.

k0 relevant prior only success

k7 +7 unrelated sessions failure

Same task (PatternLock), same memory system (FrameSamp-Modul). At k0 the relevant demonstration is the only thing in the history buffer and the policy reproduces the pattern. At k7, with seven unrelated sessions inserted before that demonstration, the same system fails. Closing that gap is what this benchmark measures.

Abstract

Robots deployed in realistic settings accumulate experience across many sessions, tasks, and users. Current robot-memory evaluations focus on what a policy can recall within a single episode or short context. We introduce RoboMME-Interference, a cross-session benchmark over RoboMME task families that measures whether a policy can still use a relevant prior session once unrelated robot experiences are inserted into the history buffer before it.

We evaluate nine runnable memory-policy variants across nine task families and 18,450 rollouts. Perceptual memory systems improve success sharply when the relevant session is nearby, but the benefit decays as distractor sessions accumulate, and current systems largely fail to sustain it across sessions.

18,450

completed rollouts

task families

evaluated systems

history conditions

Contributions

A cross-session interference benchmark. We extend RoboMME with a controllable history buffer: a relevant prior demonstration followed by k ∈ {0,1,3,7} unrelated sessions, turning memory distance into a measured variable.
A complete, released result grid. Nine RoboMME task families × nine released memory systems × five history conditions × 50 episodes = 18,450 rollouts.
A measurable failure mode. Perceptual memory delivers large near-session gains (FrameSamp-Modul: +27 pp at k0) that erode steadily under interference (−26 pp by k7), while recurrent variants stay near baseline. Systems that look strong at short range can collapse across sessions.

The Benchmark

Every query episode is run through the same policy interface, varying only the external history buffer. The relevant lesson, a prior demonstration of the task, always appears first; the history conditions differ in how many unrelated sessions are inserted before the query reads it.

RoboMME-Interference protocol diagram — History conditions. `no-history`: query only. `k0`: the relevant prior session only. `k1`/`k3`/`k7`: the relevant prior session followed by 1, 3, or 7 unrelated distractor sessions. Larger `k` pushes the relevant memory farther back.

Why these nine tasks. The protocol requires the task-relevant information to live in a self-contained prior session that can be lifted into the history buffer. We therefore use the nine RoboMME tasks that deliver this information as a prior demonstration video, distinct from execution; the remaining tasks embed their cue inside a single episode and have no separable prior session to place under interference.

Why unrelated distractors. Distractor sessions are drawn from different task families, so they cannot inject contradictory same-family facts. The benchmark therefore measures interference from genuinely unrelated robot experience, the deployment-like failure mode, rather than constructed wrong priors.

Tasks

Nine task families from RoboMME, each presenting a prior demonstration the policy must use during a separate execution episode. They span three memory types.

Task	Memory type	What the prior session must supply
VideoRepick	Object	Which object was picked up in the demonstration, to re-pick it.
VideoPlaceOrder	Object	Which target in an ordered set the cube belongs on.
VideoPlaceButton	Object	The cube-to-target placement shown in the demonstration.
VideoUnmask	Spatial	Where an occluded cube of a given color is located.
VideoUnmaskSwap	Spatial	Occluded-cube location after the cubes have been swapped.
MoveCube	Procedural	The demonstrated manner of moving the cube to its target.
InsertPeg	Procedural	Which peg to grasp and how to insert it, from the demonstration.
PatternLock	Procedural	A continuous pattern traced over a grid, to be retraced.
RouteStick	Procedural	A path through obstacles, including turn directions, to repeat.

Task definitions and assets are from RoboMME (Dai et al., 2026).

Memory Systems

We run RoboMME's released memory-augmented variants through the benchmark. Each pairs a memory representation (frames, tokens, or hidden states) with a mechanism for integrating that memory into the base π_0.5 policy: Context concatenates memory tokens with the input, Modulator conditions the policy via adaptive LayerNorm, and Expert routes memory through a dedicated expert with block-wise causal attention. The perceptual variants (FrameSamp, TokenDrop) sample or drop visual-history tokens; the recurrent ones (TTT, RMT) keep a compressed hidden state. The baseline is the π_0.5 backbone with no memory.

Representation	Context	Modulator	Expert
FrameSamp perceptual	✓	✓	✓
TokenDrop perceptual	✓	✓	✓
TTT recurrent	✓	–	✓
RMT recurrent	–	–	–
Symbolic SimpleSG / GroundSG	skipped because it uses predefined subtask annotations

✓ evaluated: eight memory variants plus the π_0.5 baseline (nine systems).
– not evaluated: RMT and TTT-Modulator had no released checkpoints at evaluation time.

Results

1 Relevant memory helps — when it is near

At k0, with only the relevant session in history, perceptual memory lifts success well above each system's own no-history floor: FrameSamp-Modul from 18.2% to 45.3% (+27.1 pp) and TokenDrop-Modul from 17.1% to 35.3% (+18.2 pp), against the 17.3% π_0.5 baseline.

Overall success by history condition — Overall success by memory system and history condition. The `k0` gain is largest for FrameSamp-Modul and TokenDrop-Modul, and falls back toward the no-history floor as distractors are added.

2 Interference erodes the benefit

As unrelated sessions push the relevant memory farther back, the gain collapses: FrameSamp-Modul drops about −26.0 pp from k0 to k7 and TokenDrop-Modul about −15.6 pp. By k7 most systems sit close to where they started with no history at all.

3 What survives, and where

The decay is not uniform across systems or tasks. FrameSamp-Modul holds its margin over baseline longest and frame sampling proves more robust than token dropping, while recurrent TTT variants stay near the baseline throughout. Across tasks, the lift concentrates on easy and medium episodes; hard tasks stay near the floor for every system, since a policy can only act on a recalled session if it can do the task at all.

Per-family success rate

Where each system's memory benefit appears and how it survives interference, broken out by task family. Select any system.

System

0% 90% cell = success rate (%)

Full grid

System	No history	k0	k1	k3	k7
pi0.5 (baseline)	17.3%	-	-	-	-
FrameSamp-Modul	18.2%	45.3%	38.4%	30.0%	19.3%
TokenDrop-Modul	17.1%	35.3%	30.9%	23.6%	19.8%
FrameSamp-Context	17.8%	26.7%	18.7%	18.7%	17.8%
FrameSamp-Expert	17.6%	27.1%	20.9%	19.6%	19.6%
TokenDrop-Context	13.8%	22.9%	19.1%	15.3%	13.1%
TokenDrop-Expert	16.9%	27.3%	20.2%	15.6%	14.0%
Recurrent-TTT-Expert	16.0%	18.0%	16.4%	16.2%	15.6%
Recurrent-TTT-Context	16.9%	15.3%	16.2%	17.6%	14.0%

Success rates across all nine families.

Qualitative Examples

Watch one query episode under increasing interference. Within each task, a system's clips run left to right as the relevant prior session is pushed farther back: no-history, then k0 (relevant prior only), then k1, k3, k7 with more unrelated sessions inserted before it. A green outline marks a success in the canonical rollout table, and the rail beneath each strip traces pass to fail.

Three representative tasks and a curated set of systems, chosen to show distinct memory behaviors. The full nine-task × nine-system grid is in the results table above and in the paper; these clips are illustrative and never contradict the canonical labels.

Loading qualitative examples…

Reproducibility & Artifacts

Everything needed to reproduce the grid and build on the protocol is released.

Paper — the full method and protocol on arXiv.
GitHub repository — code, analysis scripts, and figures.
Canonical rollout CSV — all 18,450 rollouts, the source of truth.
Main success table — per-system rates with confidence intervals.
Results summary — headline numbers and effects.

Citation

@misc{rathi2026robommeinterference,
  title         = {Benchmarking Robot Memory Under Interference},
  author        = {Rathi, Soumil},
  year          = {2026},
  eprint        = {2606.22338},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2606.22338}
}

This benchmark builds on RoboMME; please cite it as well.

@article{dai2026robomme,
  title  = {RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies},
  author = {Dai, Yinpei and Fu, Hongze and Lee, Jayjun and Liu, Yuejiang and
            Zhang, Haoran and Yang, Jianing and Finn, Chelsea and Fazeli, Nima and Chai, Joyce},
  journal = {arXiv preprint arXiv:2603.04639},
  year   = {2026}
}