End-to-end Temporal Action Detection(TAD)has achieved remarkable progress in recent years,driven by innovations in model architectures and the emergence of Video Foundation Models(VFMs).However,existing TAD methods th...End-to-end Temporal Action Detection(TAD)has achieved remarkable progress in recent years,driven by innovations in model architectures and the emergence of Video Foundation Models(VFMs).However,existing TAD methods that perform full fine-tuning of pretrained video models often incur substantial computational costs,which become particularly pronounced when processing long video sequences.Moreover,the need for precise temporal boundary annotations makes data labeling extremely expensive.In low-resource settings where annotated samples are scarce,direct fine-tuning tends to cause overfitting.To address these challenges,we introduce Dynamic LowRank Adapter(DyLoRA),a lightweight fine-tuning framework tailored specifically for the TAD task.Built upon the Low-Rank Adaptation(LoRA)architecture,DyLoRA adapts only the key layers of the pretrained model via low-rank decomposition,reducing the number of trainable parameters to less than 5%of full fine-tuning methods.This significantly lowers memory consumption and mitigates overfitting in low-resource settings.Notably,DyLoRA enhances the temporal modeling capability of pretrained models by optimizing temporal dimension weights,thereby alleviating the representation misalignment of temporal features.Experimental results demonstrate that DyLoRA-TAD achieves impressive performance,with 73.9%mAP on THUMOS14,39.52%on ActivityNet-1.3,and 28.2%on Charades,substantially surpassing the best traditional feature-based methods.展开更多
Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods o...Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods often face challenges in generalizing to unseen action categories due to their reliance on visual features,resulting in limited generalization.In this paper,we propose a novel framework,Concept-Guided Semantic Projection(CSP),to enhance the generalization ability of OV-TAD methods.By projecting video features into a unified action concept space,CSP enables the use of abstracted action concepts for action detection,rather than solely relying on visual details.To further improve feature consistency across action categories,we introduce a mutual contrastive loss(MCL),ensuring semantic coherence and better feature discrimination.Extensive experiments on the ActivityNet and THUMOS14 benchmarks demonstrate that our method outperforms state-of-the-art OV-TAD methods.Code and data are available at Concept-Guided-OV-TAD.展开更多
基金supported by the National Natural Science Foundation of China(Grant No.62266054)the Major Science and Technology Project of Yunnan Province(Grant No.202402AD080002)the Scientific Research Fund of the Yunnan Provincial Department of Education(Grant No.2025Y0302).
文摘End-to-end Temporal Action Detection(TAD)has achieved remarkable progress in recent years,driven by innovations in model architectures and the emergence of Video Foundation Models(VFMs).However,existing TAD methods that perform full fine-tuning of pretrained video models often incur substantial computational costs,which become particularly pronounced when processing long video sequences.Moreover,the need for precise temporal boundary annotations makes data labeling extremely expensive.In low-resource settings where annotated samples are scarce,direct fine-tuning tends to cause overfitting.To address these challenges,we introduce Dynamic LowRank Adapter(DyLoRA),a lightweight fine-tuning framework tailored specifically for the TAD task.Built upon the Low-Rank Adaptation(LoRA)architecture,DyLoRA adapts only the key layers of the pretrained model via low-rank decomposition,reducing the number of trainable parameters to less than 5%of full fine-tuning methods.This significantly lowers memory consumption and mitigates overfitting in low-resource settings.Notably,DyLoRA enhances the temporal modeling capability of pretrained models by optimizing temporal dimension weights,thereby alleviating the representation misalignment of temporal features.Experimental results demonstrate that DyLoRA-TAD achieves impressive performance,with 73.9%mAP on THUMOS14,39.52%on ActivityNet-1.3,and 28.2%on Charades,substantially surpassing the best traditional feature-based methods.
基金supported by the National Natural Science Foundation of China under Grant No.62402490the Guangdong Basic and Applied Basic Research Foundation of China under Grant No.2025A1515010101.
文摘Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods often face challenges in generalizing to unseen action categories due to their reliance on visual features,resulting in limited generalization.In this paper,we propose a novel framework,Concept-Guided Semantic Projection(CSP),to enhance the generalization ability of OV-TAD methods.By projecting video features into a unified action concept space,CSP enables the use of abstracted action concepts for action detection,rather than solely relying on visual details.To further improve feature consistency across action categories,we introduce a mutual contrastive loss(MCL),ensuring semantic coherence and better feature discrimination.Extensive experiments on the ActivityNet and THUMOS14 benchmarks demonstrate that our method outperforms state-of-the-art OV-TAD methods.Code and data are available at Concept-Guided-OV-TAD.