PointCLIP: Point Cloud Understanding by CLIP

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, Hongsheng Li

Type: Article

Publication Date: 2022-06-01

Citations: 200

DOI: https://doi.org/10.1109/cvpr52688.2022.00836

Abstract

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on Model-NetlO, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https://github.com/ZrrSkywalker/PointCLIP.

Locations

arXiv (Cornell University) - View - PDF
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - View

Similar Works

Action	Title	Year	Authors
+	PointCLIP: Point Cloud Understanding by CLIP	2021	Renrui Zhang Z. J. Guo Wei Zhang Kunchang Li Xupeng Miao Bin Cui Yu Qiao Peng Gao Hongsheng Li
+ PDF Chat	PointCLIP: Point Cloud Understanding by CLIP	2021	Renrui Zhang Z. J. Guo Wei Zhang Kunchang Li Xupeng Miao Bin Cui Yu Qiao Peng Gao Hongsheng Li
+	CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP	2023	Runnan Chen Youquan Liu Lingdong Kong Xinge Zhu Yuexin Ma Yikang Li Yuenan Hou Yu Qiao Wenping Wang
+	CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data	2023	Yihan Zeng Chenhan Jiang Jiageng Mao Jianhua Han Chaoqiang Ye Qingqiu Huang Dit‐Yan Yeung Zhen Yang Xiaodan Liang Hang Xu
+ PDF Chat	CALIP: Zero-Shot Enhancement of CLIP with Parameter-Free Attention	2023	Z. J. Guo Renrui Zhang Longtian Qiu Xianzheng Ma Xupeng Miao Xuming He Bin Cui
+ PDF Chat	ESP-Zero: Unsupervised enhancement of zero-shot classification for Extremely Sparse Point cloud	2024	Jiayi Han Zidi Cao Weibo Zheng Xiangguo Zhou Xiangjian He Yuanfang Zhang Daisen Wei
+	CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention	2022	Z. J. Guo Renrui Zhang Longtian Qiu Xianzheng Ma Xupeng Miao Xuming He Bin Cui
+	CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training	2022	Tianyu Huang Bowen Dong Yunhan Yang Xiaoshui Huang Rynson W. H. Lau Wanli Ouyang Wangmeng Zuo
+	PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning	2022	Xiangyang Zhu Renrui Zhang Bowei He Ziyao Zeng Shanghang Zhang Peng Gao
+ PDF Chat	PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning	2023	Xiangyang Zhu Renrui Zhang Bowei He Ziyu Guo Ziyao Zeng Zipeng Qin Shanghang Zhang Peng Gao
+ PDF Chat	CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP	2023	Runnan Chen Youquan Liu Lingdong Kong Xinge Zhu Yuexin Ma Yikang Li Yuenan Hou Yu Qiao Wenping Wang
+	CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition	2023	Deepti Hegde Jeya Maria Jose Valanarasu Vishal M. Patel
+ PDF Chat	CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training	2023	Tianyu Huang Bowen Dong Yunhan Yang Xiaoshui Huang Rynson W. H. Lau Wanli Ouyang Wangmeng Zuo
+ PDF Chat	CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition	2023	Deepti Hegde Jeya Maria Jose Valanarasu Vishal M. Patel
+	EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder	2022	Xiaoshui Huang Sheng Li Wentao Qu Tong He Yifan Zuo Wanli Ouyang
+	ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding	2022	Le Xue Mingfei Gao Xing Chen Roberto Martín-Martín Jiajun Wu Caiming Xiong Ran Xu Juan Carlos Niebles Silvio Savarese
+ PDF Chat	ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding	2023	Le Xue Mingfei Gao Xing Chen Roberto Martín-Martín Jiajun Wu Caiming Xiong Ran Xu Juan Carlos Niebles Silvio Savarese
+	Joint Representation Learning for Text and 3D Point Cloud	2023	Rui Huang Xuran Pan Henry Zheng Haojun Jiang Zhifeng Xie Shiji Song Gao Huang
+	OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding	2023	Minghua Liu Ruoxi Shi Kaiming Kuang Yinhao Zhu Xuanlin Li Shizhong Han Hong Cai Fatih Porikli Hao Su
+ PDF Chat	CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP	2023	Junbo Zhang Runpei Dong Kaisheng Ma

Works That Cite This (79)

Action	Title	Year	Authors
+ PDF Chat	Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model	2024	June Moh Goo Zichao Zeng J. Boehm
+	Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training	2023	Ziyu Guo Renrui Zhang Longtian Qiu Xianzhi Li Pheng‐Ann Heng
+	Not Only Generative Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis	2023	Yankun Wu Yuta Nakashima Noa García
+ PDF Chat	Nearest Neighbors Meet Deep Neural Networks for Point Cloud Analysis	2023	Renrui Zhang Liuhui Wang Ziyu Guo Jianbo Shi
+	Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation	2023	Haowei Wang Jiji Tang Jiayi Ji Xiaoshuai Sun Rongsheng Zhang Yiwei Ma Minda Zhao Lincheng Li Zeng Zhao Tangjie Lv
+	CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting	2023	Shaoxiang Guo Qing Cai Lin Qi Junyu Dong
+ PDF Chat	AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization	2023	Tanvir Mahmud Diana Marculescu
+ PDF Chat	Can Language Understand Depth?	2022	Renrui Zhang Ziyao Zeng Ziyu Guo Yafeng Li
+ PDF Chat	Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model	2023	Yinghui Xing Qirui Wu De Cheng Shizhou Zhang Guoqiang Liang Peng Wang Yanning Zhang
+ PDF Chat	EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding	2023	Yanmin Wu Xinhua Cheng Renrui Zhang Zesen Cheng Jian Zhang

Works Cited by This (59)

Action	Title	Year	Authors
+ PDF Chat	Multi-view Convolutional Neural Networks for 3D Shape Recognition	2015	Hang Su Subhransu Maji Evangelos Kalogerakis Erik Learned-Miller
+ PDF Chat	Fully convolutional networks for semantic segmentation	2015	Jonathan Long Evan Shelhamer Trevor Darrell
+ PDF Chat	3D ShapeNets: A deep representation for volumetric shapes	2015	Zhirong Wu Shuran Song Aditya Khosla Fisher Yu Linguang Zhang Xiaoou Tang Jianxiong Xiao
+ PDF Chat	Deep Residual Learning for Image Recognition	2016	Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
+ PDF Chat	Latent Embeddings for Zero-Shot Classification	2016	Yongqin Xian Zeynep Akata Gaurav Sharma Quynh L. Nguyen Matthias Hein Bernt Schiele
+ PDF Chat	Gaze Embeddings for Zero-Shot Image Classification	2017	Nour Karessli Zeynep Akata Bernt Schiele Andreas Bulling
+ PDF Chat	DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs	2017	Liang-Chieh Chen George Papandreou Iasonas Kokkinos Kevin Murphy Alan Yuille
+ PDF Chat	Learning a Deep Embedding Model for Zero-Shot Learning	2017	Li Zhang Tao Xiang Shaogang Gong
+ PDF Chat	Multi-view 3D Object Detection Network for Autonomous Driving	2017	Xiaozhi Chen Huimin Ma Ji Wan Bo Li Tian Xia
+ PDF Chat	PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation	2017	Raffaelli Charles Hao Su Kaichun Mo Leonidas Guibas