Learning activities interactions between small groups is a key step in understanding team sports videos.Recent research focusing on team sports videos can be strictly regarded from the perspective of the audience rath...Learning activities interactions between small groups is a key step in understanding team sports videos.Recent research focusing on team sports videos can be strictly regarded from the perspective of the audience rather than the athlete.For team sports videos such as volleyball and basketball videos,there are plenty of intra-team and inter-team relations.In this paper,a new task named Group Scene Graph Generation is introduced to better understand intra-team relations and inter-team relations in sports videos.To tackle this problem,a novel Hierarchical Relation Network is proposed.After all players in a video are finely divided into two teams,the feature of the two teams’activities and interactions will be enhanced by Graph Convolutional Networks,which are finally recognized to generate Group Scene Graph.For evaluation,built on Volleyball dataset with additional 9660 team activity labels,a Volleyball+dataset is proposed.A baseline is set for better comparison and our experimental results demonstrate the effectiveness of our method.Moreover,the idea of our method can be directly utilized in another video-based task,Group Activity Recognition.Experiments show the priority of our method and display the link between the two tasks.Finally,from the athlete’s view,we elaborately present an interpretation that shows how to utilize Group Scene Graph to analyze teams’activities and provide professional gaming suggestions.展开更多
Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action...Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action recognition networks either perform simple temporal fusion through averaging or rely on pre-trained models from image recognition,resulting in limited temporal information extraction capabilities.This work proposes a highly efficient temporal decoding module that can be seamlessly integrated into any action recognition backbone network to enhance the focus on temporal relationships between video frames.Firstly,the decoder initializes a set of learnable queries,termed video-level action category prediction queries.Then,they are combined with the video frame features extracted by the backbone network after self-attention learning to extract video context information.Finally,these prediction queries with rich temporal features are used for category prediction.Experimental results on HMDB51,MSRDailyAct3D,Diving48 and Breakfast datasets show that using TokShift-Transformer and VideoMAE as encoders results in a significant improvement in Top-1 accuracy compared to the original models(TokShift-Transformer and VideoMAE),after introducing the proposed temporal decoder.The introduction of the temporal decoder results in an average performance increase exceeding 11%for TokShift-Transformer and nearly 5%for VideoMAE across the four datasets.Furthermore,the work explores the combination of the decoder with various action recognition networks,including Timesformer,as encoders.This results in an average accuracy improvement of more than 3.5%on the HMDB51 dataset.The code is available at https://github.com/huangturbo/TempDecoder.展开更多
Transformers,originally designed for natural language processing,have demonstrated remarkable capabilities in modeling long-range dependencies,capturing complex spatiotemporal patterns,and enhancing the interpretabili...Transformers,originally designed for natural language processing,have demonstrated remarkable capabilities in modeling long-range dependencies,capturing complex spatiotemporal patterns,and enhancing the interpretability of video-based AI systems.In recent years,the integration of transformer-based architectures has significantly advanced the field of intelligent network video analysis,reshaping traditional paradigms in video understanding,surveillance,and real-time processing.展开更多
Few-Shot Action Recognition(FSAR)has been a heat topic in various areas,such as computer vision and forest ecosystem security.FSAR aims to recognize previously unseen classes using limited labeled video examples.A pri...Few-Shot Action Recognition(FSAR)has been a heat topic in various areas,such as computer vision and forest ecosystem security.FSAR aims to recognize previously unseen classes using limited labeled video examples.A principal challenge in the FSAR task is to obtain more action semantics related to the category from a few samples for classification.Recent studies attempt to compensate for visual information through action labels.However,concise action category names lead to less distinct semantic space and potential performance limitations.In this work,we propose a novel Semantic-guided Video Multimodal Fusion Network for FSAR(SVMFN-FSAR).We utilize the Large Language Model(LLM)to expand detailed textual knowledge of various action categories,enhancing the distinction of semantic space and alleviating the problem of insufficient samples in FSAR tasks to some extent.We perform the matching metric between the extracted distinctive semantic information and the visual information of unknown class samples to understand the overall semantics of the video for preliminary classification.In addition,we design a novel semantic-guided temporal interaction module based on Transformers,which can make the LLM-expanded knowledge and visual information complement each other,and improve the quality of feature representation in samples.Experimental results on three few-shot benchmarks,Kinetics,UCF101,and HMDB51,consistently demonstrate the effectiveness and interpretability of the proposed method.展开更多
Automated recognition of violent activities from videos is vital for public safety,but often raises significant privacy concerns due to the sensitive nature of the footage.Moreover,resource constraints often hinder th...Automated recognition of violent activities from videos is vital for public safety,but often raises significant privacy concerns due to the sensitive nature of the footage.Moreover,resource constraints often hinder the deployment of deep learning-based complex video classification models on edge devices.With this motivation,this study aims to investigate an effective violent activity classifier while minimizing computational complexity,attaining competitive performance,and mitigating user data privacy concerns.We present a lightweight deep learning architecture with fewer parameters for efficient violent activity recognition.We utilize a two-stream formation of 3D depthwise separable convolution coupled with a linear self-attention mechanism for effective feature extraction,incorporating federated learning to address data privacy concerns.Experimental findings demonstrate the model’s effectiveness with test accuracies from 96%to above 97%on multiple datasets by incorporating the FedProx aggregation strategy.These findings underscore the potential to develop secure,efficient,and reliable solutions for violent activity recognition in real-world scenarios.展开更多
Dance-driven music generation aims to generate musical pieces conditioned on dance videos.Previous works focus on monophonic or raw audio generation,while the multi-instrument scenario is under-explored.The challenges...Dance-driven music generation aims to generate musical pieces conditioned on dance videos.Previous works focus on monophonic or raw audio generation,while the multi-instrument scenario is under-explored.The challenges associated with dancedriven multi-instrument music(MIDI)generation are twofold:(i)lack of a publicly available multi-instrument MIDI and video paired dataset and(ii)the weak correlation between music and video.To tackle these challenges,we have built the first multi-instrument MIDI and dance paired dataset(D2MIDI).Based on this dataset,we introduce a multi-instrument MIDI generation framework(Dance2MIDI)conditioned on dance video.Firstly,to capture the relationship between dance and music,we employ a graph convolutional network to encode the dance motion.This allows us to extract features related to dance movement and dance style.Secondly,to generate a harmonious rhythm,we utilize a transformer model to decode the drum track sequence,leveraging a cross-attention mechanism.Thirdly,we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task.A BERTlike model is employed to comprehend the context of the entire music piece through self-supervised learning.We evaluate the music generated by our framework trained on the D2MIDI dataset and demonstrate that our method achieves state-of-the-art performance.展开更多
We introduce the Open Sequential Repetitive Action Counting(OSRAC)task,which aims to count all repetitions and locate transition boundaries of sequential actions from large-scale video data,without relying on predefin...We introduce the Open Sequential Repetitive Action Counting(OSRAC)task,which aims to count all repetitions and locate transition boundaries of sequential actions from large-scale video data,without relying on predefined action categories.Unlike the Repetitive Action Counting(RAC)task that focuses on a single-action assumption,OSRAC handles diverse and alternating repetitive action sequences in real-world scenarios,which is fundamentally more challenging.To this end,we propose UniCount,a universal system capable of counting multiple sequential repetitive actions from video data.Specifically,UniCount designs three primary modules:the Universal Repetitive Pattern Learner(URPL)to capture general repetitive patterns in alternating actions,Temporal Action Boundary Discriminator(TABD)to locate the action transition boundaries,and Dual Density Map Estimator(DDME)to achieve action counting and repetition segmentation.We also design a novel actionness loss to improve the detection of action transitions.To support this task,we conduct in-depth data analysis on existing RAC datasets and construct several OSRAC benchmarks(i.e.,MUCFRep,MRepCount,and MInfiniteRep)by developing a pipeline on data processing and mining.We further perform comprehensive experiments to evaluate the effectiveness of UniCount.On MInfiniteRep,UniCount substantially improves the Off-By-One Accuracy(OBOA)from 0.39 to 0.78 and decreases the Mean Absolute Error(MAE)from 0.29 to 0.14 compared to counterparts.UniCount also achieves superior performance in open-set data,showcasing its universality.展开更多
Moving object segmentation(MOS),aiming at segmenting moving objects from video frames,is an important and challenging task in computer vision and with various applications.With the development of deep learning(DL),MOS...Moving object segmentation(MOS),aiming at segmenting moving objects from video frames,is an important and challenging task in computer vision and with various applications.With the development of deep learning(DL),MOS has also entered the era of deep models toward spatiotemporal feature learning.This paper aims to provide the latest review of recent DL-based MOS methods proposed during the past three years.Specifically,we present a more up-to-date categorization based on model characteristics,then compare and discuss each category from feature learning(FL),and model training and evaluation perspectives.For FL,the methods reviewed are divided into three types:spatial FL,temporal FL,and spatiotemporal FL,then analyzed from input and model architectures aspects,three input types,and four typical preprocessing subnetworks are summarized.In terms of training,we discuss ideas for enhancing model transferability.In terms of evaluation,based on a previous categorization of scene dependent evaluation and scene independent evaluation,and combined with whether used videos are recorded with static or moving cameras,we further provide four subdivided evaluation setups and analyze that of reviewed methods.We also show performance comparisons of some reviewed MOS methods and analyze the advantages and disadvantages of reviewed MOS methods in terms of technology.Finally,based on the above comparisons and discussions,we present research prospects and future directions.展开更多
基金National Natural Science Foundation of China(Grant No.U20B2069)Fundamental Research Funds for the Central Universities.
文摘Learning activities interactions between small groups is a key step in understanding team sports videos.Recent research focusing on team sports videos can be strictly regarded from the perspective of the audience rather than the athlete.For team sports videos such as volleyball and basketball videos,there are plenty of intra-team and inter-team relations.In this paper,a new task named Group Scene Graph Generation is introduced to better understand intra-team relations and inter-team relations in sports videos.To tackle this problem,a novel Hierarchical Relation Network is proposed.After all players in a video are finely divided into two teams,the feature of the two teams’activities and interactions will be enhanced by Graph Convolutional Networks,which are finally recognized to generate Group Scene Graph.For evaluation,built on Volleyball dataset with additional 9660 team activity labels,a Volleyball+dataset is proposed.A baseline is set for better comparison and our experimental results demonstrate the effectiveness of our method.Moreover,the idea of our method can be directly utilized in another video-based task,Group Activity Recognition.Experiments show the priority of our method and display the link between the two tasks.Finally,from the athlete’s view,we elaborately present an interpretation that shows how to utilize Group Scene Graph to analyze teams’activities and provide professional gaming suggestions.
基金Shanghai Municipal Commission of Economy and Information Technology,China (No.202301054)。
文摘Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action recognition networks either perform simple temporal fusion through averaging or rely on pre-trained models from image recognition,resulting in limited temporal information extraction capabilities.This work proposes a highly efficient temporal decoding module that can be seamlessly integrated into any action recognition backbone network to enhance the focus on temporal relationships between video frames.Firstly,the decoder initializes a set of learnable queries,termed video-level action category prediction queries.Then,they are combined with the video frame features extracted by the backbone network after self-attention learning to extract video context information.Finally,these prediction queries with rich temporal features are used for category prediction.Experimental results on HMDB51,MSRDailyAct3D,Diving48 and Breakfast datasets show that using TokShift-Transformer and VideoMAE as encoders results in a significant improvement in Top-1 accuracy compared to the original models(TokShift-Transformer and VideoMAE),after introducing the proposed temporal decoder.The introduction of the temporal decoder results in an average performance increase exceeding 11%for TokShift-Transformer and nearly 5%for VideoMAE across the four datasets.Furthermore,the work explores the combination of the decoder with various action recognition networks,including Timesformer,as encoders.This results in an average accuracy improvement of more than 3.5%on the HMDB51 dataset.The code is available at https://github.com/huangturbo/TempDecoder.
文摘Transformers,originally designed for natural language processing,have demonstrated remarkable capabilities in modeling long-range dependencies,capturing complex spatiotemporal patterns,and enhancing the interpretability of video-based AI systems.In recent years,the integration of transformer-based architectures has significantly advanced the field of intelligent network video analysis,reshaping traditional paradigms in video understanding,surveillance,and real-time processing.
基金supported by the National Key Research and Development Program(No.2022YFD2201005)the National Natural Science Foundation of China(Nos.62072246 and 32371877)+1 种基金the Postdoctoral Fellowship Program of CPSF(No.GZB20230302)the Construction of Forest Fire Prevention Comprehensive System in Chongli District,Jiakou City(Unmanned Aerial Vehicle Patrol Monitoring System)(No.DA2020001).
文摘Few-Shot Action Recognition(FSAR)has been a heat topic in various areas,such as computer vision and forest ecosystem security.FSAR aims to recognize previously unseen classes using limited labeled video examples.A principal challenge in the FSAR task is to obtain more action semantics related to the category from a few samples for classification.Recent studies attempt to compensate for visual information through action labels.However,concise action category names lead to less distinct semantic space and potential performance limitations.In this work,we propose a novel Semantic-guided Video Multimodal Fusion Network for FSAR(SVMFN-FSAR).We utilize the Large Language Model(LLM)to expand detailed textual knowledge of various action categories,enhancing the distinction of semantic space and alleviating the problem of insufficient samples in FSAR tasks to some extent.We perform the matching metric between the extracted distinctive semantic information and the visual information of unknown class samples to understand the overall semantics of the video for preliminary classification.In addition,we design a novel semantic-guided temporal interaction module based on Transformers,which can make the LLM-expanded knowledge and visual information complement each other,and improve the quality of feature representation in samples.Experimental results on three few-shot benchmarks,Kinetics,UCF101,and HMDB51,consistently demonstrate the effectiveness and interpretability of the proposed method.
基金Supported by the Research Chair of Online Dialogue and Cultural Communication,King Saud University,Saudi Arabia.
文摘Automated recognition of violent activities from videos is vital for public safety,but often raises significant privacy concerns due to the sensitive nature of the footage.Moreover,resource constraints often hinder the deployment of deep learning-based complex video classification models on edge devices.With this motivation,this study aims to investigate an effective violent activity classifier while minimizing computational complexity,attaining competitive performance,and mitigating user data privacy concerns.We present a lightweight deep learning architecture with fewer parameters for efficient violent activity recognition.We utilize a two-stream formation of 3D depthwise separable convolution coupled with a linear self-attention mechanism for effective feature extraction,incorporating federated learning to address data privacy concerns.Experimental findings demonstrate the model’s effectiveness with test accuracies from 96%to above 97%on multiple datasets by incorporating the FedProx aggregation strategy.These findings underscore the potential to develop secure,efficient,and reliable solutions for violent activity recognition in real-world scenarios.
基金supported by the National Social Science Foundation Art Project(No.20BC040)China Scholarship Council(CSC)Grant(No.202306320525).
文摘Dance-driven music generation aims to generate musical pieces conditioned on dance videos.Previous works focus on monophonic or raw audio generation,while the multi-instrument scenario is under-explored.The challenges associated with dancedriven multi-instrument music(MIDI)generation are twofold:(i)lack of a publicly available multi-instrument MIDI and video paired dataset and(ii)the weak correlation between music and video.To tackle these challenges,we have built the first multi-instrument MIDI and dance paired dataset(D2MIDI).Based on this dataset,we introduce a multi-instrument MIDI generation framework(Dance2MIDI)conditioned on dance video.Firstly,to capture the relationship between dance and music,we employ a graph convolutional network to encode the dance motion.This allows us to extract features related to dance movement and dance style.Secondly,to generate a harmonious rhythm,we utilize a transformer model to decode the drum track sequence,leveraging a cross-attention mechanism.Thirdly,we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task.A BERTlike model is employed to comprehend the context of the entire music piece through self-supervised learning.We evaluate the music generated by our framework trained on the D2MIDI dataset and demonstrate that our method achieves state-of-the-art performance.
基金supported by the National Key Research and Development Program of China(No.2022YFF0604504)the National Natural Science Foundation of China(No.62172439)+2 种基金the Major Project of Natural Science Foundation of Hunan Province(No.2021JC0004)the National Natural Science Fund for Excellent Young Scholars of Hunan Province(No.2023JJ20076)the Central South University Innovation-Driven Research Programme(No.2023CXQD061).
文摘We introduce the Open Sequential Repetitive Action Counting(OSRAC)task,which aims to count all repetitions and locate transition boundaries of sequential actions from large-scale video data,without relying on predefined action categories.Unlike the Repetitive Action Counting(RAC)task that focuses on a single-action assumption,OSRAC handles diverse and alternating repetitive action sequences in real-world scenarios,which is fundamentally more challenging.To this end,we propose UniCount,a universal system capable of counting multiple sequential repetitive actions from video data.Specifically,UniCount designs three primary modules:the Universal Repetitive Pattern Learner(URPL)to capture general repetitive patterns in alternating actions,Temporal Action Boundary Discriminator(TABD)to locate the action transition boundaries,and Dual Density Map Estimator(DDME)to achieve action counting and repetition segmentation.We also design a novel actionness loss to improve the detection of action transitions.To support this task,we conduct in-depth data analysis on existing RAC datasets and construct several OSRAC benchmarks(i.e.,MUCFRep,MRepCount,and MInfiniteRep)by developing a pipeline on data processing and mining.We further perform comprehensive experiments to evaluate the effectiveness of UniCount.On MInfiniteRep,UniCount substantially improves the Off-By-One Accuracy(OBOA)from 0.39 to 0.78 and decreases the Mean Absolute Error(MAE)from 0.29 to 0.14 compared to counterparts.UniCount also achieves superior performance in open-set data,showcasing its universality.
基金National Natural Science Foundation of China(Nos.61702323 and 62172268)the Shanghai Municipal Natural Science Foundation,China(No.20ZR1423100)+2 种基金the Open Fund of Science and Technology on Thermal Energy and Power Laboratory(No.TPL2020C02)Wuhan 2nd Ship Design and Research Institute,Wuhan,China,the National Key Research and Development Program of China(No.2018YFB1306303)the Major Basic Research Projects of Natural Science Foundation of Shandong Province,China(No.ZR2019ZD07).
文摘Moving object segmentation(MOS),aiming at segmenting moving objects from video frames,is an important and challenging task in computer vision and with various applications.With the development of deep learning(DL),MOS has also entered the era of deep models toward spatiotemporal feature learning.This paper aims to provide the latest review of recent DL-based MOS methods proposed during the past three years.Specifically,we present a more up-to-date categorization based on model characteristics,then compare and discuss each category from feature learning(FL),and model training and evaluation perspectives.For FL,the methods reviewed are divided into three types:spatial FL,temporal FL,and spatiotemporal FL,then analyzed from input and model architectures aspects,three input types,and four typical preprocessing subnetworks are summarized.In terms of training,we discuss ideas for enhancing model transferability.In terms of evaluation,based on a previous categorization of scene dependent evaluation and scene independent evaluation,and combined with whether used videos are recorded with static or moving cameras,we further provide four subdivided evaluation setups and analyze that of reviewed methods.We also show performance comparisons of some reviewed MOS methods and analyze the advantages and disadvantages of reviewed MOS methods in terms of technology.Finally,based on the above comparisons and discussions,we present research prospects and future directions.