Current spatio-temporal action detection methods lack sufficient capabilities in extracting and comprehending spatio-temporal information. This paper introduces an end-to-end Adaptive Cross-Scale Fusion Encoder-Decode...Current spatio-temporal action detection methods lack sufficient capabilities in extracting and comprehending spatio-temporal information. This paper introduces an end-to-end Adaptive Cross-Scale Fusion Encoder-Decoder (ACSF-ED) network to predict the action and locate the object efficiently. In the Adaptive Cross-Scale Fusion Spatio-Temporal Encoder (ACSF ST-Encoder), the Asymptotic Cross-scale Feature-fusion Module (ACCFM) is designed to address the issue of information degradation caused by the propagation of high-level semantic information, thereby extracting high-quality multi-scale features to provide superior features for subsequent spatio-temporal information modeling. Within the Shared-Head Decoder structure, a shared classification and regression detection head is constructed. A multi-constraint loss function composed of one-to-one, one-to-many, and contrastive denoising losses is designed to address the problem of insufficient constraint force in predicting results with traditional methods. This loss function enhances the accuracy of model classification predictions and improves the proximity of regression position predictions to ground truth objects. The proposed method model is evaluated on the popular dataset UCF101-24 and JHMDB-21. Experimental results demonstrate that the proposed method achieves an accuracy of 81.52% on the Frame-mAP metric, surpassing current existing methods.展开更多
As Deepfake technology continues to evolve,the distinction between real and fake content becomes increasingly blurred.Most existing Deepfake video detectionmethods rely on single-frame facial image features,which limi...As Deepfake technology continues to evolve,the distinction between real and fake content becomes increasingly blurred.Most existing Deepfake video detectionmethods rely on single-frame facial image features,which limits their ability to capture temporal differences between frames.Current methods also exhibit limited generalization capabilities,struggling to detect content generated by unknown forgery algorithms.Moreover,the diversity and complexity of forgery techniques introduced by Artificial Intelligence Generated Content(AIGC)present significant challenges for traditional detection frameworks,whichmust balance high detection accuracy with robust performance.To address these challenges,we propose a novel Deepfake detection framework that combines a two-stream convolutional network with a Vision Transformer(ViT)module to enhance spatio-temporal feature representation.The ViT model extracts spatial features from the forged video,while the 3D convolutional network captures temporal features.The 3D convolution enables cross-frame feature extraction,allowing the model to detect subtle facial changes between frames.The confidence scores from both the ViT and 3D convolution submodels are fused at the decision layer,enabling themodel to effectively handle unknown forgery techniques.Focusing on Deepfake videos and GAN-generated images,the proposed approach is evaluated on two widely used public face forgery datasets.Compared to existing state-of-theartmethods,it achieves higher detection accuracy and better generalization performance,offering a robust solution for deepfake detection in real-world scenarios.展开更多
Health monitoring of electro-mechanical actuator(EMA)is critical to ensure the security of airplanes.It is difficult or even impossible to collect enough labeled failure or degradation data from actual EMA.The autoenc...Health monitoring of electro-mechanical actuator(EMA)is critical to ensure the security of airplanes.It is difficult or even impossible to collect enough labeled failure or degradation data from actual EMA.The autoencoder based on reconstruction loss is a popular model that can carry out anomaly detection with only consideration of normal training data,while it fails to capture spatio-temporal information from multivariate time series signals of multiple monitoring sensors.To mine the spatio-temporal information from multivariate time series signals,this paper proposes an attention graph stacked autoencoder for EMA anomaly detection.Firstly,attention graph con-volution is introduced into autoencoder to convolve temporal information from neighbor features to current features based on different weight attentions.Secondly,stacked autoencoder is applied to mine spatial information from those new aggregated temporal features.Finally,based on the bench-mark reconstruction loss of normal training data,different health thresholds calculated by several statistic indicators can carry out anomaly detection for new testing data.In comparison with tra-ditional stacked autoencoder,the proposed model could obtain higher fault detection rate and lower false alarm rate in EMA anomaly detection experiment.展开更多
Most of the exist action recognition methods mainly utilize spatio-temporal descriptors of single interest point while ignoring their potential integral information, such as spatial distribution information. By combin...Most of the exist action recognition methods mainly utilize spatio-temporal descriptors of single interest point while ignoring their potential integral information, such as spatial distribution information. By combining local spatio-temporal feature and global positional distribution information(PDI) of interest points, a novel motion descriptor is proposed in this paper. The proposed method detects interest points by using an improved interest point detection method. Then, 3-dimensional scale-invariant feature transform(3D SIFT) descriptors are extracted for every interest point. In order to obtain a compact description and efficient computation, the principal component analysis(PCA) method is utilized twice on the 3D SIFT descriptors of single frame and multiple frames. Simultaneously, the PDI of the interest points are computed and combined with the above features. The combined features are quantified and selected and finally tested by using the support vector machine(SVM) recognition algorithm on the public KTH dataset. The testing results have showed that the recognition rate has been significantly improved and the proposed features can more accurately describe human motion with high adaptability to scenarios.展开更多
Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action det...Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action detection is not clear,and the relevant reviews are not comprehensive.Thus,this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field.Firstly,according to the way that temporal and spatial features are extracted from the model,the commonly used models of action recognition are divided into the two stream models,the temporal models,the spatiotemporal models and the transformer models according to the architecture.And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets.Then,from the perspective of tasks to be completed,action detection is further divided into temporal action detection and spatiotemporal action detection,and commonly used datasets are introduced.From the perspectives of the twostage method and one-stage method,various algorithms of temporal action detection are reviewed,and the various algorithms of spatiotemporal action detection are summarized in detail.Finally,the relationship between different parts of action recognition and detection is discussed,the difficulties faced by the current research are summarized in detail,and future development was prospected。展开更多
Laboratory safety is a critical area of broad societal concern,particularly in the detection of abnormal actions.To enhance the efficiency and accuracy of detecting such actions,this paper introduces a novel method ca...Laboratory safety is a critical area of broad societal concern,particularly in the detection of abnormal actions.To enhance the efficiency and accuracy of detecting such actions,this paper introduces a novel method called TubeRAPT(Tubelet Transformer based onAdapter and Prefix TrainingModule).Thismethod primarily comprises three key components:the TubeR network,an adaptive clustering attention mechanism,and a prefix training module.These components work in synergy to address the challenge of knowledge preservation in models pretrained on large datasets while maintaining training efficiency.The TubeR network serves as the backbone for spatio-temporal feature extraction,while the adaptive clustering attention mechanism refines the focus on relevant information.The prefix training module facilitates efficient fine-tuning and knowledge transfer.Experimental results demonstrate the effectiveness of TubeRAPT,achieving a 68.44%mean Average Precision(mAP)on the CLA(Crazy LabActivity)small-scale dataset,marking a significant improvement of 1.53%over the previous TubeR method.This research not only showcases the potential applications of TubeRAPT in the field of abnormal action detection but also offers innovative ideas and technical support for the future development of laboratory safety monitoring technologies.The proposed method has implications for improving safety management systems in various laboratory environments,potentially reducing accidents and enhancing overall workplace safety.展开更多
Most of the intelligent surveillances in the industry only care about the safety of the workers.It is meaningful if the camera can know what,where and how the worker has performed the action in real time.In this paper...Most of the intelligent surveillances in the industry only care about the safety of the workers.It is meaningful if the camera can know what,where and how the worker has performed the action in real time.In this paper,we propose a light-weight and robust algorithm to meet these requirements.By only two hands'trajectories,our algorithm requires no Graphic Processing Unit(GPU)acceleration,which can be used in low-cost devices.In the training stage,in order to find potential topological structures of the training trajectories,spectral clustering with eigengap heuristic is applied to cluster trajectory points.A gradient descent based algorithm is proposed to find the topological structures,which reflects main representations for each cluster.In the fine-tuning stage,a topological optimization algorithm is proposed to fine-tune the parameters of topological structures in all training data.Finally,our method not only performs more robustly compared to some popular offline action detection methods,but also obtains better detection accuracy in an extended action sequence.展开更多
Understanding the dynamics of urbanization is essential to the sustainable development of cities. Meanwhile the analysis of urban development can also provide scientifically and effective information for decision-maki...Understanding the dynamics of urbanization is essential to the sustainable development of cities. Meanwhile the analysis of urban development can also provide scientifically and effective information for decision-making. With the long-term Defense Meteorological Satellite Program’s Operational Linescan System(DMSP/OLS) nighttime light images, a pixel level assessment of urbanization of China from 1992 to 2013 was conducted in this study, and the spatio-temporal dynamics and future trends of urban development were fully detected. The results showed that the urbanization and urban dynamics of China experienced drastic fluctuations from 1992 to 2013, especially for those in the coastal and metropolitan areas. From a regional perspective, it was found that the urban dynamics and increasing trends in North Coast China, East Coast China and South Coast China were much more stable and significant than that in other regions. Moreover, with the sustainability estimating of nighttime light dynamics, the regional agglomeration trends of urban regions were also detected. The light intensity in nearly 50% of lighted pixels may continuously decrease in the future, indicating a severe situation of urbanization within these regions. In this study, The results revealed in this study can provided a new insight in long time urbanization detecting and is thus beneficial to the better understanding of trends and dynamics of urban development.展开更多
Automatic detection of student engagement levels from videos,which is a spatio-temporal classification problem is crucial for enhancing the quality of online education.This paper addresses this challenge by proposing ...Automatic detection of student engagement levels from videos,which is a spatio-temporal classification problem is crucial for enhancing the quality of online education.This paper addresses this challenge by proposing four novel hybrid end-to-end deep learning models designed for the automatic detection of student engagement levels in e-learning videos.The evaluation of these models utilizes the DAiSEE dataset,a public repository capturing student affective states in e-learning scenarios.The initial model integrates EfficientNetV2-L with Gated Recurrent Unit(GRU)and attains an accuracy of 61.45%.Subsequently,the second model combines EfficientNetV2-L with bidirectional GRU(Bi-GRU),yielding an accuracy of 61.56%.The third and fourth models leverage a fusion of EfficientNetV2-L with Long Short-Term Memory(LSTM)and bidirectional LSTM(Bi-LSTM),achieving accuracies of 62.11%and 61.67%,respectively.Our findings demonstrate the viability of these models in effectively discerning student engagement levels,with the EfficientNetV2-L+LSTM model emerging as the most proficient,reaching an accuracy of 62.11%.This study underscores the potential of hybrid spatio-temporal networks in automating the detection of student engagement,thereby contributing to advancements in online education quality.展开更多
End-to-end Temporal Action Detection(TAD)has achieved remarkable progress in recent years,driven by innovations in model architectures and the emergence of Video Foundation Models(VFMs).However,existing TAD methods th...End-to-end Temporal Action Detection(TAD)has achieved remarkable progress in recent years,driven by innovations in model architectures and the emergence of Video Foundation Models(VFMs).However,existing TAD methods that perform full fine-tuning of pretrained video models often incur substantial computational costs,which become particularly pronounced when processing long video sequences.Moreover,the need for precise temporal boundary annotations makes data labeling extremely expensive.In low-resource settings where annotated samples are scarce,direct fine-tuning tends to cause overfitting.To address these challenges,we introduce Dynamic LowRank Adapter(DyLoRA),a lightweight fine-tuning framework tailored specifically for the TAD task.Built upon the Low-Rank Adaptation(LoRA)architecture,DyLoRA adapts only the key layers of the pretrained model via low-rank decomposition,reducing the number of trainable parameters to less than 5%of full fine-tuning methods.This significantly lowers memory consumption and mitigates overfitting in low-resource settings.Notably,DyLoRA enhances the temporal modeling capability of pretrained models by optimizing temporal dimension weights,thereby alleviating the representation misalignment of temporal features.Experimental results demonstrate that DyLoRA-TAD achieves impressive performance,with 73.9%mAP on THUMOS14,39.52%on ActivityNet-1.3,and 28.2%on Charades,substantially surpassing the best traditional feature-based methods.展开更多
Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods o...Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods often face challenges in generalizing to unseen action categories due to their reliance on visual features,resulting in limited generalization.In this paper,we propose a novel framework,Concept-Guided Semantic Projection(CSP),to enhance the generalization ability of OV-TAD methods.By projecting video features into a unified action concept space,CSP enables the use of abstracted action concepts for action detection,rather than solely relying on visual details.To further improve feature consistency across action categories,we introduce a mutual contrastive loss(MCL),ensuring semantic coherence and better feature discrimination.Extensive experiments on the ActivityNet and THUMOS14 benchmarks demonstrate that our method outperforms state-of-the-art OV-TAD methods.Code and data are available at Concept-Guided-OV-TAD.展开更多
基金support for this work was supported by Key Lab of Intelligent and Green Flexographic Printing under Grant ZBKT202301.
文摘Current spatio-temporal action detection methods lack sufficient capabilities in extracting and comprehending spatio-temporal information. This paper introduces an end-to-end Adaptive Cross-Scale Fusion Encoder-Decoder (ACSF-ED) network to predict the action and locate the object efficiently. In the Adaptive Cross-Scale Fusion Spatio-Temporal Encoder (ACSF ST-Encoder), the Asymptotic Cross-scale Feature-fusion Module (ACCFM) is designed to address the issue of information degradation caused by the propagation of high-level semantic information, thereby extracting high-quality multi-scale features to provide superior features for subsequent spatio-temporal information modeling. Within the Shared-Head Decoder structure, a shared classification and regression detection head is constructed. A multi-constraint loss function composed of one-to-one, one-to-many, and contrastive denoising losses is designed to address the problem of insufficient constraint force in predicting results with traditional methods. This loss function enhances the accuracy of model classification predictions and improves the proximity of regression position predictions to ground truth objects. The proposed method model is evaluated on the popular dataset UCF101-24 and JHMDB-21. Experimental results demonstrate that the proposed method achieves an accuracy of 81.52% on the Frame-mAP metric, surpassing current existing methods.
基金supported by National Natural Science Foundation of China(Nos.62477026,62177029,61807020)Humanities and Social Sciences Research Program of the Ministry of Education of China(No.23YJAZH047)the Startup Foundation for Introducing Talent of Nanjing University of Posts and Communications under Grant NY222034.
文摘As Deepfake technology continues to evolve,the distinction between real and fake content becomes increasingly blurred.Most existing Deepfake video detectionmethods rely on single-frame facial image features,which limits their ability to capture temporal differences between frames.Current methods also exhibit limited generalization capabilities,struggling to detect content generated by unknown forgery algorithms.Moreover,the diversity and complexity of forgery techniques introduced by Artificial Intelligence Generated Content(AIGC)present significant challenges for traditional detection frameworks,whichmust balance high detection accuracy with robust performance.To address these challenges,we propose a novel Deepfake detection framework that combines a two-stream convolutional network with a Vision Transformer(ViT)module to enhance spatio-temporal feature representation.The ViT model extracts spatial features from the forged video,while the 3D convolutional network captures temporal features.The 3D convolution enables cross-frame feature extraction,allowing the model to detect subtle facial changes between frames.The confidence scores from both the ViT and 3D convolution submodels are fused at the decision layer,enabling themodel to effectively handle unknown forgery techniques.Focusing on Deepfake videos and GAN-generated images,the proposed approach is evaluated on two widely used public face forgery datasets.Compared to existing state-of-theartmethods,it achieves higher detection accuracy and better generalization performance,offering a robust solution for deepfake detection in real-world scenarios.
基金supported by the National Natural Science Foundation of China (No.52075349)the National Natural Science Foundation of China (No.62303335)+1 种基金the Postdoctoral Researcher Program of China (No.GZC20231779)the Natural Science Foundation of Sichuan Province (No.2022NSFSC1942).
文摘Health monitoring of electro-mechanical actuator(EMA)is critical to ensure the security of airplanes.It is difficult or even impossible to collect enough labeled failure or degradation data from actual EMA.The autoencoder based on reconstruction loss is a popular model that can carry out anomaly detection with only consideration of normal training data,while it fails to capture spatio-temporal information from multivariate time series signals of multiple monitoring sensors.To mine the spatio-temporal information from multivariate time series signals,this paper proposes an attention graph stacked autoencoder for EMA anomaly detection.Firstly,attention graph con-volution is introduced into autoencoder to convolve temporal information from neighbor features to current features based on different weight attentions.Secondly,stacked autoencoder is applied to mine spatial information from those new aggregated temporal features.Finally,based on the bench-mark reconstruction loss of normal training data,different health thresholds calculated by several statistic indicators can carry out anomaly detection for new testing data.In comparison with tra-ditional stacked autoencoder,the proposed model could obtain higher fault detection rate and lower false alarm rate in EMA anomaly detection experiment.
基金supported by National Natural Science Foundation of China(No.61103123)Scientific Research Foundation for the Returned Overseas Chinese Scholars,State Education Ministry
文摘Most of the exist action recognition methods mainly utilize spatio-temporal descriptors of single interest point while ignoring their potential integral information, such as spatial distribution information. By combining local spatio-temporal feature and global positional distribution information(PDI) of interest points, a novel motion descriptor is proposed in this paper. The proposed method detects interest points by using an improved interest point detection method. Then, 3-dimensional scale-invariant feature transform(3D SIFT) descriptors are extracted for every interest point. In order to obtain a compact description and efficient computation, the principal component analysis(PCA) method is utilized twice on the 3D SIFT descriptors of single frame and multiple frames. Simultaneously, the PDI of the interest points are computed and combined with the above features. The combined features are quantified and selected and finally tested by using the support vector machine(SVM) recognition algorithm on the public KTH dataset. The testing results have showed that the recognition rate has been significantly improved and the proposed features can more accurately describe human motion with high adaptability to scenarios.
基金supported by the National Educational Science 13th Five-Year Plan Project(JYKYB2019012)the Basic Research Fund for the Engineering University of PAP(WJY201907)the Basic Research Fund of the Engineering University of PAP(WJY202120).
文摘Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action detection is not clear,and the relevant reviews are not comprehensive.Thus,this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field.Firstly,according to the way that temporal and spatial features are extracted from the model,the commonly used models of action recognition are divided into the two stream models,the temporal models,the spatiotemporal models and the transformer models according to the architecture.And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets.Then,from the perspective of tasks to be completed,action detection is further divided into temporal action detection and spatiotemporal action detection,and commonly used datasets are introduced.From the perspectives of the twostage method and one-stage method,various algorithms of temporal action detection are reviewed,and the various algorithms of spatiotemporal action detection are summarized in detail.Finally,the relationship between different parts of action recognition and detection is discussed,the difficulties faced by the current research are summarized in detail,and future development was prospected。
基金supported by the Philosophy and Social Sciences Planning Project of Guangdong Province of China(GD23XGL099)the Guangdong General Universities Young Innovative Talents Project(2023KQNCX247)the Research Project of Shanwei Institute of Technology(SWKT22-019).
文摘Laboratory safety is a critical area of broad societal concern,particularly in the detection of abnormal actions.To enhance the efficiency and accuracy of detecting such actions,this paper introduces a novel method called TubeRAPT(Tubelet Transformer based onAdapter and Prefix TrainingModule).Thismethod primarily comprises three key components:the TubeR network,an adaptive clustering attention mechanism,and a prefix training module.These components work in synergy to address the challenge of knowledge preservation in models pretrained on large datasets while maintaining training efficiency.The TubeR network serves as the backbone for spatio-temporal feature extraction,while the adaptive clustering attention mechanism refines the focus on relevant information.The prefix training module facilitates efficient fine-tuning and knowledge transfer.Experimental results demonstrate the effectiveness of TubeRAPT,achieving a 68.44%mean Average Precision(mAP)on the CLA(Crazy LabActivity)small-scale dataset,marking a significant improvement of 1.53%over the previous TubeR method.This research not only showcases the potential applications of TubeRAPT in the field of abnormal action detection but also offers innovative ideas and technical support for the future development of laboratory safety monitoring technologies.The proposed method has implications for improving safety management systems in various laboratory environments,potentially reducing accidents and enhancing overall workplace safety.
基金Our research has been supported in part by National Natural Science Foundation of China under Grants 61673261 and 61703273.We gratefully acknowledge the support from some companies.
文摘Most of the intelligent surveillances in the industry only care about the safety of the workers.It is meaningful if the camera can know what,where and how the worker has performed the action in real time.In this paper,we propose a light-weight and robust algorithm to meet these requirements.By only two hands'trajectories,our algorithm requires no Graphic Processing Unit(GPU)acceleration,which can be used in low-cost devices.In the training stage,in order to find potential topological structures of the training trajectories,spectral clustering with eigengap heuristic is applied to cluster trajectory points.A gradient descent based algorithm is proposed to find the topological structures,which reflects main representations for each cluster.In the fine-tuning stage,a topological optimization algorithm is proposed to fine-tune the parameters of topological structures in all training data.Finally,our method not only performs more robustly compared to some popular offline action detection methods,but also obtains better detection accuracy in an extended action sequence.
基金Under the auspices of State Scholarship Fund of China Scholarship Council(No.201706320300)。
文摘Understanding the dynamics of urbanization is essential to the sustainable development of cities. Meanwhile the analysis of urban development can also provide scientifically and effective information for decision-making. With the long-term Defense Meteorological Satellite Program’s Operational Linescan System(DMSP/OLS) nighttime light images, a pixel level assessment of urbanization of China from 1992 to 2013 was conducted in this study, and the spatio-temporal dynamics and future trends of urban development were fully detected. The results showed that the urbanization and urban dynamics of China experienced drastic fluctuations from 1992 to 2013, especially for those in the coastal and metropolitan areas. From a regional perspective, it was found that the urban dynamics and increasing trends in North Coast China, East Coast China and South Coast China were much more stable and significant than that in other regions. Moreover, with the sustainability estimating of nighttime light dynamics, the regional agglomeration trends of urban regions were also detected. The light intensity in nearly 50% of lighted pixels may continuously decrease in the future, indicating a severe situation of urbanization within these regions. In this study, The results revealed in this study can provided a new insight in long time urbanization detecting and is thus beneficial to the better understanding of trends and dynamics of urban development.
文摘Automatic detection of student engagement levels from videos,which is a spatio-temporal classification problem is crucial for enhancing the quality of online education.This paper addresses this challenge by proposing four novel hybrid end-to-end deep learning models designed for the automatic detection of student engagement levels in e-learning videos.The evaluation of these models utilizes the DAiSEE dataset,a public repository capturing student affective states in e-learning scenarios.The initial model integrates EfficientNetV2-L with Gated Recurrent Unit(GRU)and attains an accuracy of 61.45%.Subsequently,the second model combines EfficientNetV2-L with bidirectional GRU(Bi-GRU),yielding an accuracy of 61.56%.The third and fourth models leverage a fusion of EfficientNetV2-L with Long Short-Term Memory(LSTM)and bidirectional LSTM(Bi-LSTM),achieving accuracies of 62.11%and 61.67%,respectively.Our findings demonstrate the viability of these models in effectively discerning student engagement levels,with the EfficientNetV2-L+LSTM model emerging as the most proficient,reaching an accuracy of 62.11%.This study underscores the potential of hybrid spatio-temporal networks in automating the detection of student engagement,thereby contributing to advancements in online education quality.
基金the International Collaborative Research Program of Shanghai Science and Technology Committee(No.12510708400)the Summit Filmology Program of Shanghai University in 2015(No.n.13-a303-15-w23)
基金supported by the National Natural Science Foundation of China(Grant No.62266054)the Major Science and Technology Project of Yunnan Province(Grant No.202402AD080002)the Scientific Research Fund of the Yunnan Provincial Department of Education(Grant No.2025Y0302).
文摘End-to-end Temporal Action Detection(TAD)has achieved remarkable progress in recent years,driven by innovations in model architectures and the emergence of Video Foundation Models(VFMs).However,existing TAD methods that perform full fine-tuning of pretrained video models often incur substantial computational costs,which become particularly pronounced when processing long video sequences.Moreover,the need for precise temporal boundary annotations makes data labeling extremely expensive.In low-resource settings where annotated samples are scarce,direct fine-tuning tends to cause overfitting.To address these challenges,we introduce Dynamic LowRank Adapter(DyLoRA),a lightweight fine-tuning framework tailored specifically for the TAD task.Built upon the Low-Rank Adaptation(LoRA)architecture,DyLoRA adapts only the key layers of the pretrained model via low-rank decomposition,reducing the number of trainable parameters to less than 5%of full fine-tuning methods.This significantly lowers memory consumption and mitigates overfitting in low-resource settings.Notably,DyLoRA enhances the temporal modeling capability of pretrained models by optimizing temporal dimension weights,thereby alleviating the representation misalignment of temporal features.Experimental results demonstrate that DyLoRA-TAD achieves impressive performance,with 73.9%mAP on THUMOS14,39.52%on ActivityNet-1.3,and 28.2%on Charades,substantially surpassing the best traditional feature-based methods.
基金supported by the National Natural Science Foundation of China under Grant No.62402490the Guangdong Basic and Applied Basic Research Foundation of China under Grant No.2025A1515010101.
文摘Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods often face challenges in generalizing to unseen action categories due to their reliance on visual features,resulting in limited generalization.In this paper,we propose a novel framework,Concept-Guided Semantic Projection(CSP),to enhance the generalization ability of OV-TAD methods.By projecting video features into a unified action concept space,CSP enables the use of abstracted action concepts for action detection,rather than solely relying on visual details.To further improve feature consistency across action categories,we introduce a mutual contrastive loss(MCL),ensuring semantic coherence and better feature discrimination.Extensive experiments on the ActivityNet and THUMOS14 benchmarks demonstrate that our method outperforms state-of-the-art OV-TAD methods.Code and data are available at Concept-Guided-OV-TAD.