摘要
In the research of video-based violent behavior detection,the motion information in the video is vital for violence detection.How to highlight motion information in videos and integrate spatiotemporal information is an urgent problem that needs to be solved in violence detection.In this paper,we propose a deep learning architecture that integrates shallow features into deep features to strengthen the network's ability to express motion information at a deep level.To enhance the weight of motion information in the network,we design a downsampling module to extract shallow features,fused with the deep features extracted by MobileNet's Blocks.Furthermore we constructed a channel attention module and introduced a Convolutional Long Short-Term Memory(ConvLSTM)module.These two modules aim to redistribute network attention:the channel attention module focuses on channel-level information and the ConvLSTM module emphasizes temporal aspects.Finally,we employ 3D convolution and global pooling to compress the feature sizes,fed into fully connected layers to perform violence detection.Experiments are conducted on three publicly available standard datasets,achieving an accuracy rate of 91%on the surveillance video dataset RWF2000,97.5%on the Hockey fight dataset,and 100%on the movies dataset.Overall,the proposed model demonstrates satisfactory performance in violence detection.