Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

Type: Preprint

Publication Date: 2024-06-27

Citations: 1

DOI: https://doi.org/10.48550/arxiv.2406.19263

Abstract

Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the SPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements. Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen reading tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed SPR benchmark, which includes GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: screen-point-and-read.github.io

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs 2024 Keen You
Haotian Zhang
Eldon Schoop
Floris Weers
Amanda Swearngin
Jeffrey Nichols
Yinfei Yang
Zhe Gan
+ PDF Chat Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents 2024 Boyu Gou
Ruohan Wang
Boyuan Zheng
Yanan Xie
Cheng Chang
Yiheng Shu
Huan Sun
Yu Su
+ GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation 2023 An Yan
Zhengyuan Yang
Wanrong Zhu
Kevin Lin
Linjie Li
Jianfeng Wang
Jianwei Yang
Yiwu Zhong
Julian McAuley
Jianfeng Gao
+ PDF Chat Ponder & Press: Advancing Visual GUI Agent towards General Computer Control 2024 Yiqin Wang
Haoji Zhang
Jing-qi Tian
Yansong Tang
+ Referring to Screen Texts with Voice Assistants 2023 Shruti Bhargava Choubey
Anand Dhoot
Ing‐Marie Jonsson
Long Hoang Nguyen
Alkesh Patel
Hong Yu
Vincent Renkens
+ PDF Chat Understanding mobile GUI: From pixel-words to screen-sentences 2024 Jingwen Fu
Xiaoyi Zhang
Yuwang Wang
Wenjun Zeng
Nanning Zheng
+ PDF Chat GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents 2024 Dongping Chen
Yue Huang
Siyuan Wu
Jingyu Tang
L Chen
Yilin Bai
Zhigang He
Chenlong Wang
Huichi Zhou
Yiqiang Li
+ SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents 2024 Kanzhi Cheng
Qiushi Sun
Yougang Chu
Fangzhi Xu
Yantao Li
Jianbing Zhang
Zhiyong Wu
+ CogAgent: A Visual Language Model for GUI Agents 2023 Wenyi Hong
Weihan Wang
Qingsong Lv
Jiazheng Xu
Wenmeng Yu
Junhui Ji
Yan Wang
Zihan Wang
Yuxiao Dong
Ming Ding
+ PDF Chat ScreenAI: A Vision-Language Model for UI and Infographics Understanding 2024 Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Cărbune
Jason Lin
Jindong Chen
Abhanshu Sharma
+ PDF Chat ScreenAI: A Vision-Language Model for UI and Infographics Understanding 2024 Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Cărbune
Jason Lin
Jindong Chen
Abhanshu Sharma
+ PDF Chat E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion 2024 Ke Wang
Tianyu Xia
Zhangxuan Gu
Yi Zhao
Shuheng Shen
Changhua Meng
Weiqiang Wang
Ke Xu
+ PDF Chat Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning 2024 Hai-Ming Xu
Qi Chen
Lei Wang
Lingqiao Liu
+ Understanding Mobile GUI: from Pixel-Words to Screen-Sentences 2021 Jingwen Fu
Xiaoyi Zhang
Yuwang Wang
Wenjun Zeng
Sam Yang
Grayson Hilliard
+ PDF Chat Large Language Model-Brained GUI Agents: A Survey 2024 Chaoyun Zhang
Shilin He
Jiaxu Qian
Bowen Li
Liqun Li
Si Qin
Yu Kang
Minghua Ma
G. M. Liu
Qingwei Lin
+ PDF Chat Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining 2024 Zhiqi Ge
Juncheng Li
Xiaoli Pang
Minghe Gao
Kaihang Pan
Lin Wang
Fei Hao
Wenqiao Zhang
Siliang Tang
Yueting Zhuang
+ PDF Chat Inferring Alt-text For UI Icons With Large Language Models During App Development 2024 Sabrina Haque
Christoph Csallner
+ PDF Chat Improved GUI Grounding via Iterative Narrowing 2024 Anthony Nguyen
+ PDF Chat GUIDE: Graphical User Interface Data for Execution 2024 Rajat Chawla
A. N. JHA
Muskaan Kumar
Mukunda NS
Ishaan Bhola
+ You Only Look at Screens: Multimodal Chain-of-Action Agents 2023 Zhuosheng Zhang
Aston Zhang

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors