ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Type: Preprint

Publication Date: 2024-02-07

Citations: 1

DOI: https://doi.org/10.48550/arxiv.2402.04615

Abstract

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat ScreenAI: A Vision-Language Model for UI and Infographics Understanding 2024 Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Cărbune
Jason Lin
Jindong Chen
Abhanshu Sharma
+ Lexi: Self-Supervised Learning of the UI Language 2023 Pratyay Banerjee
Shweti Mahajan
Kushal Arora
Chitta Baral
Oriana Riva
+ Lexi: Self-Supervised Learning of the UI Language 2022 Pratyay Banerjee
Shweti Mahajan
Kushal Arora
Chitta Baral
Oriana Riva
+ PDF Chat Harnessing Webpage UIs for Text-Rich Visual Understanding 2024 Junpeng Liu
Tianyue Ou
Yifan Song
Yuzhong Qu
Wai Lam
Chenyan Xiong
Wenhu Chen
Graham Neubig
Xiang Yue
+ PDF Chat Improving Language Understanding from Screenshots 2024 Tianyu Gao
Zi-Rui Wang
Adithya Bhaskar
Danqi Chen
+ Understanding Mobile GUI: from Pixel-Words to Screen-Sentences 2021 Jingwen Fu
Xiaoyi Zhang
Yuwang Wang
Wenjun Zeng
Sam Yang
Grayson Hilliard
+ Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements 2020 Li Yang
Gang Li
Luheng He
Jingjie Zheng
Hong Li
Zhiwei Guan
+ Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements 2020 Yang Li
Gang Li
Luheng He
Jingjie Zheng
Hong Li
Zhiwei Guan
+ CogAgent: A Visual Language Model for GUI Agents 2023 Wenyi Hong
Weihan Wang
Qingsong Lv
Jiazheng Xu
Wenmeng Yu
Junhui Ji
Yan Wang
Zihan Wang
Yuxiao Dong
Ming Ding
+ PDF Chat Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs 2024 Keen You
Haotian Zhang
Eldon Schoop
Floris Weers
Amanda Swearngin
Jeffrey Nichols
Yinfei Yang
Zhe Gan
+ VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling 2021 Yang Li
Gang Li
Xin Zhou
Mostafa Dehghani
Alexey Gritsenko
+ PDF Chat VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling 2021 Yang Li
Gang Li
Xin Zhou
Mostafa Dehghani
Alexey A. Gritsenko
+ ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations 2023 Yue Jiang
Eldon Schoop
Amanda Swearngin
Jeffrey Nichols
+ Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning 2021 Bryan Wang
Gang Li
Xin Zhou
Zhourong Chen
Tovi Grossman
Yang Li
+ Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning 2021 Bryan Wang
Gang Li
Xin Zhou
Zhourong Chen
Tovi Grossman
Yang Li
+ PDF Chat ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces 2021 Zecheng He
Srinivas Sunkara
Xiaoxue Zang
Ying Xu
Lijuan Liu
Nevan Wichers
Gabriel Schubiner
Ruby Lee
Jindong Chen
+ ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces 2020 Zecheng He
Srinivas Sunkara
Xiaoxue Zang
Ying Xu
Lijuan Liu
Nevan Wichers
Gabriel Schubiner
Ruby Lee
Jindong Chen
Blaise Agüera y Arcas
+ PDF Chat Aurora: Navigating UI Tarpits via Automated Neural Screen Understanding 2024 Safwat Ali Khan
Wenyu Wang
Yiran Ren
Bin Zhu
Jiangfan Shi
Alyssa McGowan
Wing Lam
Kevin Moran
+ PDF Chat Inferring Alt-text For UI Icons With Large Language Models During App Development 2024 Sabrina Haque
Christoph Csallner
+ UIBert: Learning Generic Multimodal Representations for UI Understanding 2021 Chongyang Bai
Xiaoxue Zang
Ying Xu
Srinivas Sunkara
Abhinav Rastogi
Jindong Chen
Blaise Agüera y Arcas

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors