Ponder & Press: Advancing Visual GUI Agent towards General Computer
Control
Ponder & Press: Advancing Visual GUI Agent towards General Computer
Control
Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately …