ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma

Type: Preprint

Publication Date: 2024-02-07

Citations: 1

DOI: https://doi.org/10.48550/arxiv.2402.04615

View Publication

Download PDF

Abstract

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+ PDF Chat	ScreenAI: A Vision-Language Model for UI and Infographics Understanding	2024	Gilles Baechler Srinivas Sunkara Maria Wang Fedir Zubach Hassan Mansoor Vincent Etter Victor Cărbune Jason Lin Jindong Chen Abhanshu Sharma
+	Lexi: Self-Supervised Learning of the UI Language	2023	Pratyay Banerjee Shweti Mahajan Kushal Arora Chitta Baral Oriana Riva
+	Lexi: Self-Supervised Learning of the UI Language	2022	Pratyay Banerjee Shweti Mahajan Kushal Arora Chitta Baral Oriana Riva
+ PDF Chat	Harnessing Webpage UIs for Text-Rich Visual Understanding	2024	Junpeng Liu Tianyue Ou Yifan Song Yuzhong Qu Wai Lam Chenyan Xiong Wenhu Chen Graham Neubig Xiang Yue
+ PDF Chat	Improving Language Understanding from Screenshots	2024	Tianyu Gao Zi-Rui Wang Adithya Bhaskar Danqi Chen
+	Understanding Mobile GUI: from Pixel-Words to Screen-Sentences	2021	Jingwen Fu Xiaoyi Zhang Yuwang Wang Wenjun Zeng Sam Yang Grayson Hilliard
+	Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements	2020	Li Yang Gang Li Luheng He Jingjie Zheng Hong Li Zhiwei Guan
+	Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements	2020	Yang Li Gang Li Luheng He Jingjie Zheng Hong Li Zhiwei Guan
+	CogAgent: A Visual Language Model for GUI Agents	2023	Wenyi Hong Weihan Wang Qingsong Lv Jiazheng Xu Wenmeng Yu Junhui Ji Yan Wang Zihan Wang Yuxiao Dong Ming Ding
+ PDF Chat	Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	2024	Keen You Haotian Zhang Eldon Schoop Floris Weers Amanda Swearngin Jeffrey Nichols Yinfei Yang Zhe Gan
+	VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling	2021	Yang Li Gang Li Xin Zhou Mostafa Dehghani Alexey Gritsenko
+ PDF Chat	VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling	2021	Yang Li Gang Li Xin Zhou Mostafa Dehghani Alexey A. Gritsenko
+	ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations	2023	Yue Jiang Eldon Schoop Amanda Swearngin Jeffrey Nichols
+	Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning	2021	Bryan Wang Gang Li Xin Zhou Zhourong Chen Tovi Grossman Yang Li
+	Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning	2021	Bryan Wang Gang Li Xin Zhou Zhourong Chen Tovi Grossman Yang Li
+ PDF Chat	ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces	2021	Zecheng He Srinivas Sunkara Xiaoxue Zang Ying Xu Lijuan Liu Nevan Wichers Gabriel Schubiner Ruby Lee Jindong Chen
+	ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces	2020	Zecheng He Srinivas Sunkara Xiaoxue Zang Ying Xu Lijuan Liu Nevan Wichers Gabriel Schubiner Ruby Lee Jindong Chen Blaise Agüera y Arcas
+ PDF Chat	Aurora: Navigating UI Tarpits via Automated Neural Screen Understanding	2024	Safwat Ali Khan Wenyu Wang Yiran Ren Bin Zhu Jiangfan Shi Alyssa McGowan Wing Lam Kevin Moran
+ PDF Chat	Inferring Alt-text For UI Icons With Large Language Models During App Development	2024	Sabrina Haque Christoph Csallner
+	UIBert: Learning Generic Multimodal Representations for UI Understanding	2021	Chongyang Bai Xiaoxue Zang Ying Xu Srinivas Sunkara Abhinav Rastogi Jindong Chen Blaise Agüera y Arcas

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors