Ask a Question

Prefer a chat interface with context about you and your work?

ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition

ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition

Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of human actions is still a challenging problem. Recently, Multi-Modal Learning (MML) has demonstrated promising results in this domain. …