Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Wang

Type: Preprint

Publication Date: 2024-06-27

Citations: 1

DOI: https://doi.org/10.48550/arxiv.2406.19263

Abstract

Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the SPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements. Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen reading tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed SPR benchmark, which includes GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: screen-point-and-read.github.io

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+ PDF Chat	Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	2024	Keen You Haotian Zhang Eldon Schoop Floris Weers Amanda Swearngin Jeffrey Nichols Yinfei Yang Zhe Gan
+ PDF Chat	Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	2024	Boyu Gou Ruohan Wang Boyuan Zheng Yanan Xie Cheng Chang Yiheng Shu Huan Sun Yu Su
+	GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation	2023	An Yan Zhengyuan Yang Wanrong Zhu Kevin Lin Linjie Li Jianfeng Wang Jianwei Yang Yiwu Zhong Julian McAuley Jianfeng Gao
+ PDF Chat	Ponder & Press: Advancing Visual GUI Agent towards General Computer Control	2024	Yiqin Wang Haoji Zhang Jing-qi Tian Yansong Tang
+	Referring to Screen Texts with Voice Assistants	2023	Shruti Bhargava Choubey Anand Dhoot Ing‐Marie Jonsson Long Hoang Nguyen Alkesh Patel Hong Yu Vincent Renkens
+ PDF Chat	Understanding mobile GUI: From pixel-words to screen-sentences	2024	Jingwen Fu Xiaoyi Zhang Yuwang Wang Wenjun Zeng Nanning Zheng
+ PDF Chat	GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents	2024	Dongping Chen Yue Huang Siyuan Wu Jingyu Tang L Chen Yilin Bai Zhigang He Chenlong Wang Huichi Zhou Yiqiang Li
+	SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents	2024	Kanzhi Cheng Qiushi Sun Yougang Chu Fangzhi Xu Yantao Li Jianbing Zhang Zhiyong Wu
+	CogAgent: A Visual Language Model for GUI Agents	2023	Wenyi Hong Weihan Wang Qingsong Lv Jiazheng Xu Wenmeng Yu Junhui Ji Yan Wang Zihan Wang Yuxiao Dong Ming Ding
+ PDF Chat	ScreenAI: A Vision-Language Model for UI and Infographics Understanding	2024	Gilles Baechler Srinivas Sunkara Maria Wang Fedir Zubach Hassan Mansoor Vincent Etter Victor Cărbune Jason Lin Jindong Chen Abhanshu Sharma
+ PDF Chat	ScreenAI: A Vision-Language Model for UI and Infographics Understanding	2024	Gilles Baechler Srinivas Sunkara Maria Wang Fedir Zubach Hassan Mansoor Vincent Etter Victor Cărbune Jason Lin Jindong Chen Abhanshu Sharma
+ PDF Chat	E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion	2024	Ke Wang Tianyu Xia Zhangxuan Gu Yi Zhao Shuheng Shen Changhua Meng Weiqiang Wang Ke Xu
+ PDF Chat	Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning	2024	Hai-Ming Xu Qi Chen Lei Wang Lingqiao Liu
+	Understanding Mobile GUI: from Pixel-Words to Screen-Sentences	2021	Jingwen Fu Xiaoyi Zhang Yuwang Wang Wenjun Zeng Sam Yang Grayson Hilliard
+ PDF Chat	Large Language Model-Brained GUI Agents: A Survey	2024	Chaoyun Zhang Shilin He Jiaxu Qian Bowen Li Liqun Li Si Qin Yu Kang Minghua Ma G. M. Liu Qingwei Lin
+ PDF Chat	Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining	2024	Zhiqi Ge Juncheng Li Xiaoli Pang Minghe Gao Kaihang Pan Lin Wang Fei Hao Wenqiao Zhang Siliang Tang Yueting Zhuang
+ PDF Chat	Inferring Alt-text For UI Icons With Large Language Models During App Development	2024	Sabrina Haque Christoph Csallner
+ PDF Chat	Improved GUI Grounding via Iterative Narrowing	2024	Anthony Nguyen
+ PDF Chat	GUIDE: Graphical User Interface Data for Execution	2024	Rajat Chawla A. N. JHA Muskaan Kumar Mukunda NS Ishaan Bhola
+	You Only Look at Screens: Multimodal Chain-of-Action Agents	2023	Zhuosheng Zhang Aston Zhang

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors