CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio …