ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition
ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition
Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of human actions is still a challenging problem. Recently, Multi-Modal Learning (MML) has demonstrated promising results in this domain. …