Unsupervised Document Embedding via Contrastive Augmentation

Type: Preprint

Publication Date: 2021-03-26

Citations: 4

Abstract

We present a contrasting learning approach with data augmentation techniques to learn document representations in an unsupervised manner. Inspired by recent contrastive self-supervised learning algorithms used for image and NLP pretraining, we hypothesize that high-quality document embedding should be invariant to diverse paraphrases that preserve the semantics of the original document. With different backbones and contrastive learning frameworks, our study reveals the enormous benefits of contrastive augmentation for document representation learning with two additional insights: 1) including data augmentation in a contrastive way can substantially improve the embedding quality in unsupervised document representation learning, and 2) in general, stochastic augmentations generated by simple word-level manipulation work much better than sentence-level and document-level ones. We plug our method into a classifier and compare it with a broad range of baseline methods on six benchmark datasets. Our method can decrease the classification error rate by up to 6.4% over the SOTA approaches on the document classification task, matching or even surpassing fully-supervised methods.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ Unsupervised Document Embedding via Contrastive Augmentation 2021 Dongsheng Luo
Wei Cheng
Jingchao Ni
Wenchao Yu
Xuchao Zhang
Bo Zong
Yanchi Liu
Zhengzhang Chen
Dongjin Song
Haifeng Chen
+ AugCSE: Contrastive Sentence Embedding with Diverse Augmentations 2022 Zilu Tang
Muhammed Yusuf Kocyigit
Derry Wijaya
+ Unsupervised Document Embedding With CNNs 2017 Chundi Liu
Shunan Zhao
Maksims Volkovs
+ Summarization-based Data Augmentation for Document Classification 2023 Yueguan Wang
Naoki Yoshinaga
+ Summarization-based Data Augmentation for Document Classification 2023 Yueguan Wang
Naoki Yoshinaga
+ PDF Chat UNSEE: Unsupervised Non-contrastive Sentence Embeddings 2024 Ömer Veysel Çağatan
+ PDF Chat Virtual Augmentation Supported Contrastive Learning of Sentence Representations 2022 Dejiao Zhang
Xiao Wei
Henghui Zhu
Xiaofei Ma
Andrew Arnold
+ Text Transformations in Contrastive Self-Supervised Learning: A Review 2022 Amrita Bhattacharjee
Mansooreh Karami
Huan Liu
+ PDF Chat Text Transformations in Contrastive Self-Supervised Learning: A Review 2022 Amrita Bhattacharjee
Mansooreh Karami
Huan Liu
+ Virtual Augmentation Supported Contrastive Learning of Sentence Representations 2021 Dejiao Zhang
Wei Xiao
Henghui Zhu
Xiaofei Ma
Andrew O. Arnold
+ Data augmentation approaches in natural language processing: A survey 2022 Bohan Li
Yutai Hou
Wanxiang Che
+ PDF Chat GASE: Generatively Augmented Sentence Encoding 2024 Michael L. Frank
Haithem Afli
+ PDF Chat RPN: A Word Vector Level Data Augmentation Algorithm in Deep Learning for Language Understanding 2023 Zhengqing Yuan
Xiaolong Zhang
Yue Wang
Xuecong Hou
Huiwen Xue
Zhuanzhe Zhao
Yongming Liu
+ RPN: A Word Vector Level Data Augmentation Algorithm in Deep Learning for Language Understanding 2022 Zhengqing Yuan
Zhuanzhe Zhao
Yongming Liu
Xiaolong Zhang
Xuecong Hou
Yue Wang
+ PDF Chat Shuffle & Divide: Contrastive Learning for Long Text 2022 Joonseok Lee
Seongho Joe
Kyoungwon Park
Bogun Kim
H. Kang
Jaeseon Park
Youngjune Gwon
+ CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding 2020 Yanru Qu
Dinghan Shen
Yelong Shen
Sandra Sajeev
Jiawei Han
Weizhu Chen
+ PDF Chat CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding 2020 Yanru Qu
Dinghan Shen
Yelong Shen
Sandra Sajeev
Jiawei Han
Weizhu Chen
+ CoT-BERT: Enhancing Unsupervised Sentence Representation through Chain-of-Thought 2023 Bowen Zhang
Kehua Chang
Chunping Li
+ PDF Chat Enhancing Unsupervised Sentence Embeddings via Knowledge-Driven Data Augmentation and Gaussian-Decayed Contrastive Learning 2024 Peichao Lai
Zhengfeng Zhang
Wentao Zhang
Fangcheng Fu
Bin Cui
+ DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations 2020 John Giorgi
Osvald Nitski
Gary D. Bader
Bo Wang

Works Cited by This (0)

Action Title Year Authors