Improving Language Understanding from Screenshots

Tianyu Gao, Zi-Rui Wang, Adithya Bhaskar, Danqi Chen

Type: Preprint

Publication Date: 2024-02-21

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2402.14073

Abstract

An emerging family of language models (LMs), capable of processing both text and images within a single visual view, has the promise to unlock complex tasks such as chart understanding and UI navigation. We refer to these models as screenshot language models. Despite their appeal, existing screenshot LMs substantially lag behind text-only models on language understanding tasks. To close this gap, we adopt a simplified setting where the model inputs are plain-text-rendered screenshots, and we focus on improving the text ability of screenshot LMs. We propose a novel Patch-and-Text Prediction (PTP) objective, which masks and recovers both image patches of screenshots and text within screenshots. We also conduct extensive ablation studies on masking rates and patch sizes, as well as designs for improving training stability. Our pre-trained model, while solely taking visual inputs, achieves comparable performance with BERT on 6 out of 8 GLUE tasks (within 2%) and improves up to 8% over prior work. Additionally, we extend PTP to train autoregressive screenshot LMs and demonstrate its effectiveness--our models can significantly reduce perplexity by utilizing the screenshot context. Together, we hope our findings can inspire future research on developing powerful screenshot LMs and extending their reach to broader applications.

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+ PDF Chat	ScreenAI: A Vision-Language Model for UI and Infographics Understanding	2024	Gilles Baechler Srinivas Sunkara Maria Wang Fedir Zubach Hassan Mansoor Vincent Etter Victor Cărbune Jason Lin Jindong Chen Abhanshu Sharma
+ PDF Chat	ScreenAI: A Vision-Language Model for UI and Infographics Understanding	2024	Gilles Baechler Srinivas Sunkara Maria Wang Fedir Zubach Hassan Mansoor Vincent Etter Victor Cărbune Jason Lin Jindong Chen Abhanshu Sharma
+	Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding	2022	Kenton Lee Mandar Joshi Iulia Turc Hexiang Hu Fangyu Liu Julian Martin Eisenschlos Urvashi Khandelwal Peter J. Shaw Ming‐Wei Chang Kristina Toutanova
+ PDF Chat	Harnessing Webpage UIs for Text-Rich Visual Understanding	2024	Junpeng Liu Tianyue Ou Yifan Song Yuzhong Qu Wai Lam Chenyan Xiong Wenhu Chen Graham Neubig Xiang Yue
+	Understanding Mobile GUI: from Pixel-Words to Screen-Sentences	2021	Jingwen Fu Xiaoyi Zhang Yuwang Wang Wenjun Zeng Sam Yang Grayson Hilliard
+ PDF Chat	Enhancing Vision-Language Pre-training with Rich Supervisions	2024	Yuan Gao Kunyu Shi Pengkai Zhu Edouard Belval Oren Nuriel Srikar Appalaraju Shabnam Ghadar Vijay Mahadevan Zhuowen Tu Stefano Soatto
+ PDF Chat	Inferring Alt-text For UI Icons With Large Language Models During App Development	2024	Sabrina Haque Christoph Csallner
+	BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning	2023	Ching‐Yu Chiang I-Hua Chang Shih-Wei Liao
+ PDF Chat	MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning	2024	Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel Fenglou Huang D.K. Shah Xianzhi Du B. Zhang Yanghao Li
+ PDF Chat	Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning	2024	Hai-Ming Xu Qi Chen Lei Wang Lingqiao Liu
+	CogAgent: A Visual Language Model for GUI Agents	2023	Wenyi Hong Weihan Wang Qingsong Lv Jiazheng Xu Wenmeng Yu Junhui Ji Yan Wang Zihan Wang Yuxiao Dong Ming Ding
+	Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs	2023	Yonghui Wang Wengang Zhou Hao Feng Keyi Zhou Houqiang Li
+ PDF Chat	ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding	2025	Xingyu Fu Min‐Qian Liu Zhengyuan Yang John Corring Yijuan Lu Jianwei Yang Dan Roth Dinei Florêncio Cha Zhang
+	Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning	2021	Bryan Wang Gang Li Xin Zhou Zhourong Chen Tovi Grossman Yang Li
+ PDF Chat	Vision Language Models for Spreadsheet Understanding: Challenges and Opportunities	2024	Shiyu Xia Junyu Xiong Haoyu Dong Jianbo Zhao Yuzhang Tian Mengyu Zhou Yeye He Shi Han Dongmei Zhang
+	Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning	2021	Bryan Wang Gang Li Xin Zhou Zhourong Chen Tovi Grossman Yang Li
+ PDF Chat	EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data	2024	Xuetian Chen Hangcheng Li Jiaqing Liang Sihang Jiang Deqing Yang
+ PDF Chat	Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	2024	Keen You Haotian Zhang Eldon Schoop Floris Weers Amanda Swearngin Jeffrey Nichols Yinfei Yang Zhe Gan
+	Lexi: Self-Supervised Learning of the UI Language	2023	Pratyay Banerjee Shweti Mahajan Kushal Arora Chitta Baral Oriana Riva
+	Lexi: Self-Supervised Learning of the UI Language	2022	Pratyay Banerjee Shweti Mahajan Kushal Arora Chitta Baral Oriana Riva

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors