Ask a Question

Prefer a chat interface with context about you and your work?

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio …