Robotics 5
♻ ☆ On the complexity of constrained reconfiguration and motion planning
Coordinating the motion of multiple agents in constrained environments is a
fundamental challenge in robotics, motion planning, and scheduling. A
motivating example involves $n$ robotic arms, each represented as a line
segment. The objective is to rotate each arm to its vertical orientation, one
at a time (clockwise or counterclockwise), without collisions nor rotating any
arm more than once. This scenario is an example of the more general
$k$-Compatible Ordering problem, where $n$ agents, each capable of $k$
state-changing actions, must transition to specific target states under
constraints encoded as a set $\mathcal{G}$ of $k$ pairs of directed graphs.
We show that $k$-Compatible Ordering is $\mathsf{NP}$-complete, even when
$\mathcal{G}$ is planar, degenerate, or acyclic. On the positive side, we
provide polynomial-time algorithms for cases such as when $k = 1$ or
$\mathcal{G}$ has bounded treewidth. We also introduce generalized variants
supporting multiple state-changing actions per agent, broadening the
applicability of our framework. These results extend to a wide range of
scheduling, reconfiguration, and motion planning applications in constrained
environments.
comment: Looking to incorporate comments from reviewers
♻ ☆ Insights from Interviews with Teachers and Students on the Use of a Social Robot in Computer Science Class in Sixth Grade
In this paper we report on first insights from interviews with teachers and
students on using social robots in computer science class in sixth grade. Our
focus is on learning about requirements and potential applications. We are
particularly interested in getting both perspectives, the teachers' and the
learners' view on how robots could be used and what features they should or
should not have. Results show that teachers as well as students are very open
to robots in the classroom. However, requirements are partially quite
heterogeneous among the groups. This leads to complex design challenges which
we discuss at the end of this paper.
comment: 4 pages, 2 figures, Late Breaking Report accepted for RO-MAN 2025
♻ ☆ LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation
Predictive manipulation has recently gained considerable attention in the
Embodied AI community due to its potential to improve robot policy performance
by leveraging predicted states. However, generating accurate future visual
states of robot-object interactions from world models remains a well-known
challenge, particularly in achieving high-quality pixel-level representations.
To this end, we propose LaDi-WM, a world model that predicts the latent space
of future states using diffusion modeling. Specifically, LaDi-WM leverages the
well-established latent space aligned with pre-trained Visual Foundation Models
(VFMs), which comprises both geometric features (DINO-based) and semantic
features (CLIP-based). We find that predicting the evolution of the latent
space is easier to learn and more generalizable than directly predicting
pixel-level images. Building on LaDi-WM, we design a diffusion policy that
iteratively refines output actions by incorporating forecasted states, thereby
generating more consistent and accurate results. Extensive experiments on both
synthetic and real-world benchmarks demonstrate that LaDi-WM significantly
enhances policy performance by 27.9\% on the LIBERO-LONG benchmark and 20\% on
the real-world scenario. Furthermore, our world model and policies achieve
impressive generalizability in real-world experiments.
comment: CoRL 2025
♻ ☆ Scaling Up without Fading Out: Goal-Aware Sparse GNN for RL-based Generalized Planning
Generalized planning using deep reinforcement learning (RL) combined with
graph neural networks (GNNs) has shown promising results in various symbolic
planning domains described by PDDL. However, existing approaches typically
represent planning states as fully connected graphs, leading to a combinatorial
explosion in edge information and substantial sparsity as problem scales grow,
especially evident in large grid-based environments. This dense representation
results in diluted node-level information, exponentially increases memory
requirements, and ultimately makes learning infeasible for larger-scale
problems. To address these challenges, we propose a sparse, goal-aware GNN
representation that selectively encodes relevant local relationships and
explicitly integrates spatial features related to the goal. We validate our
approach by designing novel drone mission scenarios based on PDDL within a grid
world, effectively simulating realistic mission execution environments. Our
experimental results demonstrate that our method scales effectively to larger
grid sizes previously infeasible with dense graph representations and
substantially improves policy generalization and success rates. Our findings
provide a practical foundation for addressing realistic, large-scale
generalized planning tasks.
♻ ☆ MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, Ranjay Krishna
Reasoning is central to purposeful action, yet most robotic foundation models
map perception and instructions directly to control, which limits adaptability,
generalization, and semantic grounding. We introduce Action Reasoning Models
(ARMs), a class of robotic foundation models that integrate perception,
planning, and control through a structured three-stage pipeline. Our model,
MolmoAct, encodes observations and instructions into depth-aware perception
tokens, generates mid-level spatial plans as editable trajectory traces, and
predicts precise low-level actions, enabling explainable and steerable
behavior. MolmoAct-7B-D achieves strong performance across simulation and
real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching
tasks, surpassing closed-source Pi-0 and GR00T N1; 86.6% average success on
LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks;
and in real-world fine-tuning, an additional 10% (single-arm) and an additional
22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines
by an additional 23.3% on out-of-distribution generalization and achieves top
human-preference scores for open-ended instruction following and trajectory
steering. Furthermore, we release, for the first time, the MolmoAct Dataset --
a mid-training robot dataset comprising over 10,000 high quality robot
trajectories across diverse scenarios and tasks. Training with this dataset
yields an average 5.5% improvement in general performance over the base model.
We release all model weights, training code, our collected dataset, and our
action reasoning dataset, establishing MolmoAct as both a state-of-the-art
robotics foundation model and an open blueprint for building ARMs that
transform perception into purposeful action through structured reasoning.
Blogpost: https://allenai.org/blog/molmoact
comment: Appendix include. Code, Data and Weights:
https://allenai.org/blog/molmoact