Video Anomaly Detection (VAD) poses a critical challenge in intelligent surveillance, where the objective is to differentiate rare, ill-defined abnormal events from frequent normal activities, often without explicit anomaly labels. A significant hurdle lies in expanding the discriminative boundary between normal and abnormal occurrences, especially when these boundaries are ambiguous or the anomalies are subtle.
A novel approach addresses this by introducing a Bidirectional Skip-frame Prediction (BiSP) network, built upon a dual-stream autoencoder architecture. The core innovation lies in leveraging “intra-domain disparity” between features to enhance anomaly detection.
Key Innovations:
-
Bidirectional Skip-frame Prediction: Unlike conventional methods that predict a single next frame or reconstruct the same input, BiSP employs a unique prediction strategy.
- Training Phase: It utilizes “skip frames” to train the model. For example, to predict a frame
I_n
, it uses frames I_{n-k}, I_{n-m}
(backward skip) and I_{n+p}, I_{n+q}
(forward skip). This strategy is designed to make it easier to extract robust motion features, as it forces the model to learn relationships over larger temporal gaps.
- Testing Phase: The system operates by having two streams simultaneously co-predict the same intermediate frame from consecutive frames at both ends of a sequence. For instance, frames
I_a
and I_b
predict an intermediate frame I_c
. If I_c
represents a normal event, both predictions should be accurate and consistent. However, if I_c
is anomalous, the prediction errors from both forward and backward streams will be significantly high and inconsistent, thus enlarging the “intra-domain disparity” and making anomalies more discernible. An SSIM (Structural Similarity Index Measure) consistency loss is applied between the two predicted intermediate frames to further enforce this behavior.
-
Disparity-Driven Attention Mechanisms: To further maximize the distinction between normal and anomalous events during feature extraction and processing, BiSP integrates two specialized attention modules:
- Variance Channel Attention (VarCA): This parallel attention mechanism focuses on “movement patterns.” It analyzes the variance across spatial dimensions within each feature channel, allowing the model to better discriminate based on dynamic changes and motion characteristics.
- Context Spatial Attention (ConSA): Operating in a serial manner, this module enriches feature representation by focusing on “object scales.” It employs dilated convolutions with varying kernel sizes and expansion rates to capture multi-scale spatial context, ensuring that objects of different sizes are effectively processed. This helps in identifying anomalies regardless of the scale at which they appear.
Main Prior Ingredients:
The proposed BiSP network builds upon several established computer vision and deep learning techniques:
- Autoencoders (AE): The foundational unsupervised learning architecture for VAD, where an encoder compresses input features and a decoder reconstructs or predicts outputs. The model learns to encode normal patterns effectively.
- Prediction-Based Anomaly Detection: The core paradigm relies on the principle that anomalous events lead to higher prediction errors when a model, trained only on normal data, attempts to predict them.
- Dual-Frame / Bidirectional Prediction: Prior works have explored predicting frames from both forward and backward contexts. BiSP extends this by introducing the skip-frame training and co-prediction of a single intermediate frame in testing, explicitly for enhancing disparity.
- Attention Mechanisms: The general concept of attention, which allows models to selectively focus on relevant parts of the input, is a widely adopted technique in deep learning. BiSP innovates by designing specific attention types tailored to VAD challenges (variance for motion, context spatial for multi-scale objects).
- L2 Loss and SSIM: Standard reconstruction/prediction errors are measured using L2 distance. SSIM, a common image quality metric, is employed for the consistency loss, reflecting structural similarity between predicted frames.
- PSNR (Peak Signal-to-Noise Ratio): This widely used metric quantifies the quality of reconstructed or predicted images and is adapted here to compute anomaly scores.
In essence, this work presents a robust unsupervised VAD framework that significantly improves performance by strategically utilizing bidirectional skip-frame prediction combined with novel attention mechanisms to amplify the inherent distinctions between normal and abnormal video events.