You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search

Type: Article

Publication Date: 2023-10-01

Citations: 0

DOI: https://doi.org/10.1109/icsme58846.2023.00014

Download PDF

Abstract

Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries. While the performance of code search models improves with an increase in high-quality data, obtaining such data can be challenging and expensive. Recently, large language models (LLMs) such as ChatGPT have made remarkable progress in both natural and programming language understanding and generation, offering user-friendly interaction via simple prompts. Inspired by these advancements, we propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model and leverages a filtering mechanism to eliminate low-quality augmentations. Specifically, we first propose a set of ChatGPT prompting rules that are specifically designed for source code and queries. Then, we leverage ChatGPT to rewrite code and queries based on the according prompts and then propose a filtering mechanism which trains a cross-encoder from the backbone model UniXcoder to filter out code and query pairs with low matching scores. Finally, we re-train the backbone model using the obtained high-quality augmented data. Experimental results show that ChatDANCE achieves state-of-the-art performance, improving the best baseline by 13.2% (R@1) and 7% (MRR). Surprisingly, we find that this augment-filter-retrain strategy enables the backbone model (UniXcoder) to self-grow. Moreover, extensive experiments show the effectiveness of each component and ChatDANCE has stable performance under different hyperparameter settings. In addition, we conduct qualitative and quantitative analyses to investigate why ChatDANCE works well and find that it learns a more uniform distribution of representations and effectively aligns the code and query spaces. We have made the code and data anonymously available at https://anonymous.4open.science/r/ChatDANCE.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search 2024 Yanlin Wang
Lianghong Guo
Ensheng Shi
Wenqing Chen
Jiachi Chen
Wanjun Zhong
Menghan Wang
Hui Li
Hongyu Zhang
Ziyu Lyu
+ Exploring Representation-Level Augmentation for Code Search 2022 Li HaoChen
Chunyan Miao
Cyril Leung
Yanxian Huang
Yuan Huang
Hongyu Zhang
Yanlin Wang
+ PDF Chat Exploring Representation-level Augmentation for Code Search 2022 Li HaoChen
Chunyan Miao
Cyril Leung
Yanxian Huang
Yuan Huang
Hongyu Zhang
Yanlin Wang
+ CoCoSoDa: Effective Contrastive Learning for Code Search 2022 Ensheng Shi
Wenchao Gub
Yanlin Wang
Lun Du
Hongyu Zhang
Han Shi
Dongmei Zhang
Hongbin Sun
+ PDF Chat CoCoSoDa: Effective Contrastive Learning for Code Search 2023 Ensheng Shi
Yanlin Wang
Wenchao Gu
Lun Du
Hongyu Zhang
Shi Han
Dongmei Zhang
Hongbin Sun
+ PDF Chat CoSQA+: Enhancing Code Search Dataset with Matching Code 2024 Jing Gong
Yanghui Wu
Linxi Liang
Zibin Zheng
Yanlin Wang
+ PDF Chat ARKS: Active Retrieval in Knowledge Soup for Code Generation 2024 Hongjin Su
Shuyang Jiang
Yuhang Lai
Haoyuan Wu
Boao Shi
Che Liu
Qian Liu
Changyuan Yu
+ AugmentedCode: Examining the Effects of Natural Language Resources in Code Retrieval Models 2021 Mehdi Bahrami
N. C. Shrikanth
Yuji Mizobuchi
Lei Liu
Masahiro Fukuyori
Weipeng Chen
Kazuki Munakata
+ ToolCoder: Teach Code Generation Models to use API search tools 2023 Kechi Zhang
Ge Li
Jia Li
Zhuo Li
Zhi Jin
+ Contrastive Prompt Learning-based Code Search based on Interaction Matrix 2023 Yubo Zhang
Yanfang Liu
Xinxin Fan
Yunfeng Lu
+ Instruction Fusion: Advancing Prompt Evolution through Hybridization 2023 Weidong Guo
J. S. Yang
Kaitong Yang
Xiangyang Li
Zhuwei Rao
Yu Xu
Di Niu
+ PDF Chat Selection of Prompt Engineering Techniques for Code Generation through Predicting Code Complexity 2024 Chung-Yu Wang
Alireza DaghighFarsoodeh
Hung Viet Pham
+ CoSQA: 20,000+ Web Queries for Code Search and Question Answering 2021 Junjie Huang
Duyu Tang
Linjun Shou
Ming Gong
Ke Xu
Daxin Jiang
Ming Zhou
Nan Duan
+ Evaluating and Optimizing the Effectiveness of Neural Machine Translation in Supporting Code Retrieval Models: A Study on the CAT Benchmark 2023 Hung Phan
Ali Jannesari
+ Enhancing Code Intelligence Tasks with ChatGPT 2023 Kang Yang
Xinjun Mao
Shangwen Wang
Tanghaoran Zhang
Bo Lin
Yanlin Wang
Yihao Qin
Zhang Zhang
Xiaoguang Mao
+ GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding 2023 Andor Diera
Abdelhalim Hafedh Dahou
Lukas Galke
Fabian Karl
Florian Sihler
Ansgar Scherp
+ Repository-Level Prompt Generation for Large Language Models of Code 2022 Disha Shrivastava
Hugo Larochelle
Daniel Tarlow
+ GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding 2023 Andor Diera
Abdelhalim Hafedh Dahou
Lukas Galke
Fabian Karl
Florian Sihler
Ansgar Scherp
+ PDF Chat An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities 2025 Zezhou Yang
Sirong Chen
Cuiyun Gao
Zhenhao Li
Xing Hu
Kui Liu
Xin Xia
+ PDF Chat Crystal: Illuminating LLM Abilities on Language and Code 2024 Tianhua Tao
Junbo Li
Bowen Tan
Hongyi Wang
William Marshall
Bhargav M Kanakiya
Joel Hestness
Natalia Vassilieva
Zhiqiang Shen
Eric P. Xing

Works That Cite This (0)

Action Title Year Authors