Understanding Multimodal Deep Neural Networks: A Concept Selection View
Understanding Multimodal Deep Neural Networks: A Concept Selection View
The multimodal deep neural networks, represented by CLIP, have generated rich downstream applications owing to their excellent performance, thus making understanding the decision-making process of CLIP an essential research topic. Due to the complex structure and the massive pre-training data, it is often regarded as a black-box model that is …