Actra: Optimized Transformer Architecture for Vision-Language-Action
Models in Robot Learning
Actra: Optimized Transformer Architecture for Vision-Language-Action
Models in Robot Learning
Vision-language-action models have gained significant attention for their ability to model trajectories in robot learning. However, most existing models rely on Transformer models with vanilla causal attention, which we find suboptimal for processing segmented multi-modal sequences. Additionally, the autoregressive generation approach falls short in generating multi-dimensional actions. In this paper, …