Improving Language Understanding from Screenshots

Type: Preprint

Publication Date: 2024-02-21

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2402.14073

Abstract

An emerging family of language models (LMs), capable of processing both text and images within a single visual view, has the promise to unlock complex tasks such as chart understanding and UI navigation. We refer to these models as screenshot language models. Despite their appeal, existing screenshot LMs substantially lag behind text-only models on language understanding tasks. To close this gap, we adopt a simplified setting where the model inputs are plain-text-rendered screenshots, and we focus on improving the text ability of screenshot LMs. We propose a novel Patch-and-Text Prediction (PTP) objective, which masks and recovers both image patches of screenshots and text within screenshots. We also conduct extensive ablation studies on masking rates and patch sizes, as well as designs for improving training stability. Our pre-trained model, while solely taking visual inputs, achieves comparable performance with BERT on 6 out of 8 GLUE tasks (within 2%) and improves up to 8% over prior work. Additionally, we extend PTP to train autoregressive screenshot LMs and demonstrate its effectiveness--our models can significantly reduce perplexity by utilizing the screenshot context. Together, we hope our findings can inspire future research on developing powerful screenshot LMs and extending their reach to broader applications.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat ScreenAI: A Vision-Language Model for UI and Infographics Understanding 2024 Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Cărbune
Jason Lin
Jindong Chen
Abhanshu Sharma
+ PDF Chat ScreenAI: A Vision-Language Model for UI and Infographics Understanding 2024 Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Cărbune
Jason Lin
Jindong Chen
Abhanshu Sharma
+ Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding 2022 Kenton Lee
Mandar Joshi
Iulia Turc
Hexiang Hu
Fangyu Liu
Julian Martin Eisenschlos
Urvashi Khandelwal
Peter J. Shaw
Ming‐Wei Chang
Kristina Toutanova
+ PDF Chat Harnessing Webpage UIs for Text-Rich Visual Understanding 2024 Junpeng Liu
Tianyue Ou
Yifan Song
Yuzhong Qu
Wai Lam
Chenyan Xiong
Wenhu Chen
Graham Neubig
Xiang Yue
+ Understanding Mobile GUI: from Pixel-Words to Screen-Sentences 2021 Jingwen Fu
Xiaoyi Zhang
Yuwang Wang
Wenjun Zeng
Sam Yang
Grayson Hilliard
+ PDF Chat Enhancing Vision-Language Pre-training with Rich Supervisions 2024 Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
+ PDF Chat Inferring Alt-text For UI Icons With Large Language Models During App Development 2024 Sabrina Haque
Christoph Csallner
+ BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning 2023 Ching‐Yu Chiang
I-Hua Chang
Shih-Wei Liao
+ PDF Chat MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning 2024 Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
Fenglou Huang
D.K. Shah
Xianzhi Du
B. Zhang
Yanghao Li
+ PDF Chat Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning 2024 Hai-Ming Xu
Qi Chen
Lei Wang
Lingqiao Liu
+ CogAgent: A Visual Language Model for GUI Agents 2023 Wenyi Hong
Weihan Wang
Qingsong Lv
Jiazheng Xu
Wenmeng Yu
Junhui Ji
Yan Wang
Zihan Wang
Yuxiao Dong
Ming Ding
+ Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs 2023 Yonghui Wang
Wengang Zhou
Hao Feng
Keyi Zhou
Houqiang Li
+ PDF Chat ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding 2025 Xingyu Fu
Min‐Qian Liu
Zhengyuan Yang
John Corring
Yijuan Lu
Jianwei Yang
Dan Roth
Dinei FlorĂȘncio
Cha Zhang
+ Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning 2021 Bryan Wang
Gang Li
Xin Zhou
Zhourong Chen
Tovi Grossman
Yang Li
+ PDF Chat Vision Language Models for Spreadsheet Understanding: Challenges and Opportunities 2024 Shiyu Xia
Junyu Xiong
Haoyu Dong
Jianbo Zhao
Yuzhang Tian
Mengyu Zhou
Yeye He
Shi Han
Dongmei Zhang
+ Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning 2021 Bryan Wang
Gang Li
Xin Zhou
Zhourong Chen
Tovi Grossman
Yang Li
+ PDF Chat EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data 2024 Xuetian Chen
Hangcheng Li
Jiaqing Liang
Sihang Jiang
Deqing Yang
+ PDF Chat Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs 2024 Keen You
Haotian Zhang
Eldon Schoop
Floris Weers
Amanda Swearngin
Jeffrey Nichols
Yinfei Yang
Zhe Gan
+ Lexi: Self-Supervised Learning of the UI Language 2023 Pratyay Banerjee
Shweti Mahajan
Kushal Arora
Chitta Baral
Oriana Riva
+ Lexi: Self-Supervised Learning of the UI Language 2022 Pratyay Banerjee
Shweti Mahajan
Kushal Arora
Chitta Baral
Oriana Riva

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors