期刊文献+
共找到8篇文章
< 1 >
每页显示 20 50 100
Learning group interaction for sports video understanding from a perspective of athlete
1
作者 Rui HE Zehua FU +2 位作者 Qingjie LIU Yunhong WANG Xunxun CHEN 《Frontiers of Computer Science》 SCIE EI CSCD 2024年第4期175-188,共14页
Learning activities interactions between small groups is a key step in understanding team sports videos.Recent research focusing on team sports videos can be strictly regarded from the perspective of the audience rath... Learning activities interactions between small groups is a key step in understanding team sports videos.Recent research focusing on team sports videos can be strictly regarded from the perspective of the audience rather than the athlete.For team sports videos such as volleyball and basketball videos,there are plenty of intra-team and inter-team relations.In this paper,a new task named Group Scene Graph Generation is introduced to better understand intra-team relations and inter-team relations in sports videos.To tackle this problem,a novel Hierarchical Relation Network is proposed.After all players in a video are finely divided into two teams,the feature of the two teams’activities and interactions will be enhanced by Graph Convolutional Networks,which are finally recognized to generate Group Scene Graph.For evaluation,built on Volleyball dataset with additional 9660 team activity labels,a Volleyball+dataset is proposed.A baseline is set for better comparison and our experimental results demonstrate the effectiveness of our method.Moreover,the idea of our method can be directly utilized in another video-based task,Group Activity Recognition.Experiments show the priority of our method and display the link between the two tasks.Finally,from the athlete’s view,we elaborately present an interpretation that shows how to utilize Group Scene Graph to analyze teams’activities and provide professional gaming suggestions. 展开更多
关键词 group scene graph group activity recognition scene graph generation graph convolutional network sports video understanding
原文传递
An Efficient Temporal Decoding Module for Action Recognition
2
作者 HUANG Qiubo MEI Jianmin +3 位作者 ZHAO Wupeng LU Yiru WANG Mei CHEN Dehua 《Journal of Donghua University(English Edition)》 2025年第2期187-196,共10页
Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action... Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action recognition networks either perform simple temporal fusion through averaging or rely on pre-trained models from image recognition,resulting in limited temporal information extraction capabilities.This work proposes a highly efficient temporal decoding module that can be seamlessly integrated into any action recognition backbone network to enhance the focus on temporal relationships between video frames.Firstly,the decoder initializes a set of learnable queries,termed video-level action category prediction queries.Then,they are combined with the video frame features extracted by the backbone network after self-attention learning to extract video context information.Finally,these prediction queries with rich temporal features are used for category prediction.Experimental results on HMDB51,MSRDailyAct3D,Diving48 and Breakfast datasets show that using TokShift-Transformer and VideoMAE as encoders results in a significant improvement in Top-1 accuracy compared to the original models(TokShift-Transformer and VideoMAE),after introducing the proposed temporal decoder.The introduction of the temporal decoder results in an average performance increase exceeding 11%for TokShift-Transformer and nearly 5%for VideoMAE across the four datasets.Furthermore,the work explores the combination of the decoder with various action recognition networks,including Timesformer,as encoders.This results in an average accuracy improvement of more than 3.5%on the HMDB51 dataset.The code is available at https://github.com/huangturbo/TempDecoder. 展开更多
关键词 action recognition video understanding temporal relationship temporal decoder TRANSFORMER
在线阅读 下载PDF
Editorial:Special Section on Intelligent Network Video Advances Based on Transformers
3
作者 Lin Yuanbo Wu Bo Li +4 位作者 Huibing Wang Chunhua Shen Benjamin Mora Chen Chen Xianghua Xie 《Big Data Mining and Analytics》 2025年第3期519-519,共1页
Transformers,originally designed for natural language processing,have demonstrated remarkable capabilities in modeling long-range dependencies,capturing complex spatiotemporal patterns,and enhancing the interpretabili... Transformers,originally designed for natural language processing,have demonstrated remarkable capabilities in modeling long-range dependencies,capturing complex spatiotemporal patterns,and enhancing the interpretability of video-based AI systems.In recent years,the integration of transformer-based architectures has significantly advanced the field of intelligent network video analysis,reshaping traditional paradigms in video understanding,surveillance,and real-time processing. 展开更多
关键词 enhancing interpretability TRANSFORMERS video understanding spatiotemporal patterns intelligent network video analysisreshaping intelligent network video SURVEILLANCE natural language processinghave
原文传递
SVMFN-FSAR:Semantic-Guided Video Multimodal Fusion Network for Few-Shot Action Recognition
4
作者 Ran Wei Rui Yan +3 位作者 Hongyu Qu Xing Li Qiaolin Ye Liyong Fu 《Big Data Mining and Analytics》 2025年第3期534-550,共17页
Few-Shot Action Recognition(FSAR)has been a heat topic in various areas,such as computer vision and forest ecosystem security.FSAR aims to recognize previously unseen classes using limited labeled video examples.A pri... Few-Shot Action Recognition(FSAR)has been a heat topic in various areas,such as computer vision and forest ecosystem security.FSAR aims to recognize previously unseen classes using limited labeled video examples.A principal challenge in the FSAR task is to obtain more action semantics related to the category from a few samples for classification.Recent studies attempt to compensate for visual information through action labels.However,concise action category names lead to less distinct semantic space and potential performance limitations.In this work,we propose a novel Semantic-guided Video Multimodal Fusion Network for FSAR(SVMFN-FSAR).We utilize the Large Language Model(LLM)to expand detailed textual knowledge of various action categories,enhancing the distinction of semantic space and alleviating the problem of insufficient samples in FSAR tasks to some extent.We perform the matching metric between the extracted distinctive semantic information and the visual information of unknown class samples to understand the overall semantics of the video for preliminary classification.In addition,we design a novel semantic-guided temporal interaction module based on Transformers,which can make the LLM-expanded knowledge and visual information complement each other,and improve the quality of feature representation in samples.Experimental results on three few-shot benchmarks,Kinetics,UCF101,and HMDB51,consistently demonstrate the effectiveness and interpretability of the proposed method. 展开更多
关键词 few-shot learning action recognition Large Language Model(LLM) Transformer video understanding
原文传递
Leveraging Federated Learning for Efficient Privacy-Enhancing Violent Activity Recognition from Videos
5
作者 Moshiur Rahman Tonmoy Md.Mithun Hossain +3 位作者 Mejdl Safran Sultan Alfarhood Dunren Che M.F.Mridha 《Computers, Materials & Continua》 2025年第12期5747-5763,共17页
Automated recognition of violent activities from videos is vital for public safety,but often raises significant privacy concerns due to the sensitive nature of the footage.Moreover,resource constraints often hinder th... Automated recognition of violent activities from videos is vital for public safety,but often raises significant privacy concerns due to the sensitive nature of the footage.Moreover,resource constraints often hinder the deployment of deep learning-based complex video classification models on edge devices.With this motivation,this study aims to investigate an effective violent activity classifier while minimizing computational complexity,attaining competitive performance,and mitigating user data privacy concerns.We present a lightweight deep learning architecture with fewer parameters for efficient violent activity recognition.We utilize a two-stream formation of 3D depthwise separable convolution coupled with a linear self-attention mechanism for effective feature extraction,incorporating federated learning to address data privacy concerns.Experimental findings demonstrate the model’s effectiveness with test accuracies from 96%to above 97%on multiple datasets by incorporating the FedProx aggregation strategy.These findings underscore the potential to develop secure,efficient,and reliable solutions for violent activity recognition in real-world scenarios. 展开更多
关键词 Violent activity recognition human activity recognition federated learning video understanding computer vision
在线阅读 下载PDF
Dance2MIDI:Dance-driven multi-instrument music generation
6
作者 Bo Han Yuheng Li +2 位作者 Yixuan Shen Yi Ren Feilin Han 《Computational Visual Media》 SCIE EI CSCD 2024年第4期791-802,共12页
Dance-driven music generation aims to generate musical pieces conditioned on dance videos.Previous works focus on monophonic or raw audio generation,while the multi-instrument scenario is under-explored.The challenges... Dance-driven music generation aims to generate musical pieces conditioned on dance videos.Previous works focus on monophonic or raw audio generation,while the multi-instrument scenario is under-explored.The challenges associated with dancedriven multi-instrument music(MIDI)generation are twofold:(i)lack of a publicly available multi-instrument MIDI and video paired dataset and(ii)the weak correlation between music and video.To tackle these challenges,we have built the first multi-instrument MIDI and dance paired dataset(D2MIDI).Based on this dataset,we introduce a multi-instrument MIDI generation framework(Dance2MIDI)conditioned on dance video.Firstly,to capture the relationship between dance and music,we employ a graph convolutional network to encode the dance motion.This allows us to extract features related to dance movement and dance style.Secondly,to generate a harmonious rhythm,we utilize a transformer model to decode the drum track sequence,leveraging a cross-attention mechanism.Thirdly,we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task.A BERTlike model is employed to comprehend the context of the entire music piece through self-supervised learning.We evaluate the music generated by our framework trained on the D2MIDI dataset and demonstrate that our method achieves state-of-the-art performance. 展开更多
关键词 video understanding music generation symbolic music cross-modal learning self-supervision
原文传递
UniCount:Mining Large-Scale Video Data for Universal Repetitive Action Counting
7
作者 Yin Tang Deyu Zhang +5 位作者 Wei Luo Fan Wu Feng Lyu Ruixiang Hang Lei Zhang Yaoxue Zhang 《Big Data Mining and Analytics》 2025年第5期1112-1126,共15页
We introduce the Open Sequential Repetitive Action Counting(OSRAC)task,which aims to count all repetitions and locate transition boundaries of sequential actions from large-scale video data,without relying on predefin... We introduce the Open Sequential Repetitive Action Counting(OSRAC)task,which aims to count all repetitions and locate transition boundaries of sequential actions from large-scale video data,without relying on predefined action categories.Unlike the Repetitive Action Counting(RAC)task that focuses on a single-action assumption,OSRAC handles diverse and alternating repetitive action sequences in real-world scenarios,which is fundamentally more challenging.To this end,we propose UniCount,a universal system capable of counting multiple sequential repetitive actions from video data.Specifically,UniCount designs three primary modules:the Universal Repetitive Pattern Learner(URPL)to capture general repetitive patterns in alternating actions,Temporal Action Boundary Discriminator(TABD)to locate the action transition boundaries,and Dual Density Map Estimator(DDME)to achieve action counting and repetition segmentation.We also design a novel actionness loss to improve the detection of action transitions.To support this task,we conduct in-depth data analysis on existing RAC datasets and construct several OSRAC benchmarks(i.e.,MUCFRep,MRepCount,and MInfiniteRep)by developing a pipeline on data processing and mining.We further perform comprehensive experiments to evaluate the effectiveness of UniCount.On MInfiniteRep,UniCount substantially improves the Off-By-One Accuracy(OBOA)from 0.39 to 0.78 and decreases the Mean Absolute Error(MAE)from 0.29 to 0.14 compared to counterparts.UniCount also achieves superior performance in open-set data,showcasing its universality. 展开更多
关键词 data analysis and processing video understanding action counting
原文传递
Deep Learning-based Moving Object Segmentation:Recent Progress and Research Prospects 被引量:2
8
作者 Rui Jiang Ruixiang Zhu +3 位作者 Hu Su Yinlin Li Yuan Xie Wei Zou 《Machine Intelligence Research》 EI CSCD 2023年第3期335-369,共35页
Moving object segmentation(MOS),aiming at segmenting moving objects from video frames,is an important and challenging task in computer vision and with various applications.With the development of deep learning(DL),MOS... Moving object segmentation(MOS),aiming at segmenting moving objects from video frames,is an important and challenging task in computer vision and with various applications.With the development of deep learning(DL),MOS has also entered the era of deep models toward spatiotemporal feature learning.This paper aims to provide the latest review of recent DL-based MOS methods proposed during the past three years.Specifically,we present a more up-to-date categorization based on model characteristics,then compare and discuss each category from feature learning(FL),and model training and evaluation perspectives.For FL,the methods reviewed are divided into three types:spatial FL,temporal FL,and spatiotemporal FL,then analyzed from input and model architectures aspects,three input types,and four typical preprocessing subnetworks are summarized.In terms of training,we discuss ideas for enhancing model transferability.In terms of evaluation,based on a previous categorization of scene dependent evaluation and scene independent evaluation,and combined with whether used videos are recorded with static or moving cameras,we further provide four subdivided evaluation setups and analyze that of reviewed methods.We also show performance comparisons of some reviewed MOS methods and analyze the advantages and disadvantages of reviewed MOS methods in terms of technology.Finally,based on the above comparisons and discussions,we present research prospects and future directions. 展开更多
关键词 Moving object segmentation(MOS) change detection background subtraction deep learning(DL) video understanding
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部