Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset
Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset
Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision.However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data.We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate …