RoboMME-Interference

Benchmarking Robot Memory Under Interference

A cross-session benchmark for memory-augmented VLAs

Soumil Rathi

Independent Researcher

Built on RoboMME (Dai et al., ICML 2026). We contribute a cross-session interference benchmark over its released tasks and memory variants.

k0 relevant prior only success
k7 +7 unrelated sessions failure

Same task (PatternLock), same memory system (FrameSamp-Modul). At k0 the relevant demonstration is the only thing in the history buffer and the policy reproduces the pattern. At k7, with seven unrelated sessions inserted before that demonstration, the same system fails. Closing that gap is what this benchmark measures.

Abstract

Robots deployed in realistic settings accumulate experience across many sessions, tasks, and users. Current robot-memory evaluations focus on what a policy can recall within a single episode or short context. We introduce RoboMME-Interference, a cross-session benchmark over RoboMME task families that measures whether a policy can still use a relevant prior session once unrelated robot experiences are inserted into the history buffer before it.

We evaluate nine runnable memory-policy variants across nine task families and 18,450 rollouts. Perceptual memory systems improve success sharply when the relevant session is nearby, but the benefit decays as distractor sessions accumulate, and current systems largely fail to sustain it across sessions.

18,450

completed rollouts

9

task families

9

evaluated systems

5

history conditions

Contributions

The Benchmark

Every query episode is run through the same policy interface, varying only the external history buffer. The relevant lesson, a prior demonstration of the task, always appears first; the history conditions differ in how many unrelated sessions are inserted before the query reads it.

RoboMME-Interference protocol diagram
History conditions. no-history: query only. k0: the relevant prior session only. k1/k3/k7: the relevant prior session followed by 1, 3, or 7 unrelated distractor sessions. Larger k pushes the relevant memory farther back.

Why these nine tasks. The protocol requires the task-relevant information to live in a self-contained prior session that can be lifted into the history buffer. We therefore use the nine RoboMME tasks that deliver this information as a prior demonstration video, distinct from execution; the remaining tasks embed their cue inside a single episode and have no separable prior session to place under interference.

Why unrelated distractors. Distractor sessions are drawn from different task families, so they cannot inject contradictory same-family facts. The benchmark therefore measures interference from genuinely unrelated robot experience, the deployment-like failure mode, rather than constructed wrong priors.

Tasks

Nine task families from RoboMME, each presenting a prior demonstration the policy must use during a separate execution episode. They span three memory types.

Task Memory type What the prior session must supply
VideoRepickObjectWhich object was picked up in the demonstration, to re-pick it.
VideoPlaceOrderObjectWhich target in an ordered set the cube belongs on.
VideoPlaceButtonObjectThe cube-to-target placement shown in the demonstration.
VideoUnmaskSpatialWhere an occluded cube of a given color is located.
VideoUnmaskSwapSpatialOccluded-cube location after the cubes have been swapped.
MoveCubeProceduralThe demonstrated manner of moving the cube to its target.
InsertPegProceduralWhich peg to grasp and how to insert it, from the demonstration.
PatternLockProceduralA continuous pattern traced over a grid, to be retraced.
RouteStickProceduralA path through obstacles, including turn directions, to repeat.

Task definitions and assets are from RoboMME (Dai et al., 2026).

Memory Systems

We run RoboMME's released memory-augmented variants through the benchmark. Each pairs a memory representation (frames, tokens, or hidden states) with a mechanism for integrating that memory into the base π0.5 policy: Context concatenates memory tokens with the input, Modulator conditions the policy via adaptive LayerNorm, and Expert routes memory through a dedicated expert with block-wise causal attention. The perceptual variants (FrameSamp, TokenDrop) sample or drop visual-history tokens; the recurrent ones (TTT, RMT) keep a compressed hidden state. The baseline is the π0.5 backbone with no memory.

Representation Context Modulator Expert
FrameSamp perceptual
TokenDrop perceptual
TTT recurrent
RMT recurrent
Symbolic SimpleSG / GroundSGskipped because it uses predefined subtask annotations

✓ evaluated: eight memory variants plus the π0.5 baseline (nine systems).
– not evaluated: RMT and TTT-Modulator had no released checkpoints at evaluation time.

Results

1 Relevant memory helps — when it is near

At k0, with only the relevant session in history, perceptual memory lifts success well above each system's own no-history floor: FrameSamp-Modul from 18.2% to 45.3% (+27.1 pp) and TokenDrop-Modul from 17.1% to 35.3% (+18.2 pp), against the 17.3% π0.5 baseline.

Overall success by history condition
Overall success by memory system and history condition. The k0 gain is largest for FrameSamp-Modul and TokenDrop-Modul, and falls back toward the no-history floor as distractors are added.

2 Interference erodes the benefit

As unrelated sessions push the relevant memory farther back, the gain collapses: FrameSamp-Modul drops about −26.0 pp from k0 to k7 and TokenDrop-Modul about −15.6 pp. By k7 most systems sit close to where they started with no history at all.

3 What survives, and where

The decay is not uniform across systems or tasks. FrameSamp-Modul holds its margin over baseline longest and frame sampling proves more robust than token dropping, while recurrent TTT variants stay near the baseline throughout. Across tasks, the lift concentrates on easy and medium episodes; hard tasks stay near the floor for every system, since a policy can only act on a recalled session if it can do the task at all.

Per-family success rate

Where each system's memory benefit appears and how it survives interference, broken out by task family. Select any system.

0% 90% cell = success rate (%)

Full grid

System No history k0 k1 k3 k7
pi0.5 (baseline)17.3%----
FrameSamp-Modul18.2%45.3%38.4%30.0%19.3%
TokenDrop-Modul17.1%35.3%30.9%23.6%19.8%
FrameSamp-Context17.8%26.7%18.7%18.7%17.8%
FrameSamp-Expert17.6%27.1%20.9%19.6%19.6%
TokenDrop-Context13.8%22.9%19.1%15.3%13.1%
TokenDrop-Expert16.9%27.3%20.2%15.6%14.0%
Recurrent-TTT-Expert16.0%18.0%16.4%16.2%15.6%
Recurrent-TTT-Context16.9%15.3%16.2%17.6%14.0%

Success rates across all nine families.

Qualitative Examples

Watch one query episode under increasing interference. Within each task, a system's clips run left to right as the relevant prior session is pushed farther back: no-history, then k0 (relevant prior only), then k1, k3, k7 with more unrelated sessions inserted before it. A green outline marks a success in the canonical rollout table, and the rail beneath each strip traces pass to fail.

Three representative tasks and a curated set of systems, chosen to show distinct memory behaviors. The full nine-task × nine-system grid is in the results table above and in the paper; these clips are illustrative and never contradict the canonical labels.

Loading qualitative examples…

Reproducibility & Artifacts

Everything needed to reproduce the grid and build on the protocol is released.

Citation

@misc{rathi2026robommeinterference,
  title         = {Benchmarking Robot Memory Under Interference},
  author        = {Rathi, Soumil},
  year          = {2026},
  eprint        = {2606.22338},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2606.22338}
}

This benchmark builds on RoboMME; please cite it as well.

@article{dai2026robomme,
  title  = {RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies},
  author = {Dai, Yinpei and Fu, Hongze and Lee, Jayjun and Liu, Yuejiang and
            Zhang, Haoran and Yang, Jianing and Finn, Chelsea and Fazeli, Nima and Chai, Joyce},
  journal = {arXiv preprint arXiv:2603.04639},
  year   = {2026}
}