Ask a Question

Prefer a chat interface with context about you and your work?

Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning

Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning

Vision-language-action models have gained significant attention for their ability to model trajectories in robot learning. However, most existing models rely on Transformer models with vanilla causal attention, which we find suboptimal for processing segmented multi-modal sequences. Additionally, the autoregressive generation approach falls short in generating multi-dimensional actions. In this paper, …