چكيده لاتين
Human action recognition (HAR), which involves the process of assigning labels to actions performed by an individual or a group of people from a video or just an image, still faces significant challenges. Deep neural networks, particularly transformer-based models, have made remarkable progress in this field in recent years, as their self-attention mechanisms enable focusing on relevant parts and understanding long-term dependencies in input sequences. However, transformers rely on large-scale data, which can be computationally expensive and time-consuming for real-time applications with limited resources. Using a limited number of frames for training the model risks losing critical information and selecting informative and diverse key frames remains a challenge.
To address these issues, this thesis proposes a novel method called Hybrid Embedding for video clip embedding. By combining the advantages of existing embedding techniques, the method improves action recognition with a limited number of frames. The frame arrangement compensates for temporal information loss while optimizing spatial feature extraction. Leveraging a transformer-based architecture, the proposed method effectively captures spatiotemporal information from few frames. Additionally, a keyframe extraction method is introduced to select more informative and diverse frames using the transformer model, which is particularly crucial when working with a limited number of frames. A comprehensive evaluation framework is presented to assess the impact of the proposed method on HAR. Experiments include comparisons with conventional video embedding methods, performance analysis using RGB and skeletal data (as well as their fusion), evaluation with varying frame numbers, testing with different transformer architectures, benchmarking against state-of-the-art action recognition methods, and examining the effects of pretrained models and training strategies. The computational efficiency and complexity of the proposed method are also compared with the state-of-the-art approaches.
Experimental results demonstrate that the proposed method achieves 95.42% and 96.65% accuracy on the NTU-60 and 91.70% and 80.91% accuracy on the NTU-120 datasets. The method effectively handles challenges such as variations in individuals, appearances, viewpoints, and backgrounds, proving the capability of transformer-based architectures in processing multimodal data with limited frames.