摘要
针对3D卷积在人体行为识别任务中,连续视频帧图像的时空信息提取不足且跨通道交互信息关注度不够,导致识别准确率不高的问题,提出一种基于R(2+1)D网络的多分路时空信息融合与注意力的行为识别方法。提取视频帧图像进行数据增强;以R(2+1)D网络为基础框架并融入Inception思想,对输入的视频帧图像进行多路时空特征卷积并融合,利用ECA通道注意力对融合特征筛选跨通道交互信息,以提取更抽象的高层特征;进行分类,输出人体行为识别结果。该方法充分利用视频的时空特征和跨通道交互信息,在UCF101数据集上准确率达到94.71%,比基础R(2+1)D网络提高4.53百分点;且模型参数由原来的33.3×106减小为26.9×10^(6)。实验表明,该方法能有效提高人体行为识别的准确率。
To address the problem of insufficient extraction of Spatio-temporal information from continuous video frame images and insufficient attention to cross-channel interaction information in 3D convolution in a human behavior recognition task,a behavior recognition method based on R(2+1)D network with multi-partition spatio-temporal information fusion and attention is proposed.The video frame images were extracted for data enhancement.The R(2+1)D network was used as the basic framework and incorporated with the Inception idea to convolve and fuse the input video frame images with multiple Spatio-temporal features,and the fused features were screened for cross-channel interaction information using ECA channel attention to extract more abstract high-level features.The classification was performed and the human behavior recognition results were output.The method made full use of the Spatio-temporal features and cross-channel interaction information of the video,and achieved an accuracy of 94.71% on the UCF101 dataset,which was 4.53 percentage points higher than the basic R(2+1)D network;and the model parameters were reduced from 33.3M to 26.9M.Experiments show that the method can effectively improve the accuracy of human behavior recognition.
作者
李林玉
陈淑荣
Li Linyu;Chen Shurong(College of Information Engineering,Shanghai Maritime University,Shanghai 201306,China)
出处
《计算机应用与软件》
北大核心
2026年第2期248-254,共7页
Computer Applications and Software
关键词
R(2+1)D
时空卷积
特征融合
高效通道注意力
跨通道交互
R(2+1)D Spatio-temporal convolution
Feature fusion
Efficient channel attention
Cross-channel interaction