Video reconstruction quality largely depends on the ability of employed sparse domain to adequately represent the underlying video in Distributed Compressed Video Sensing (DCVS). In this paper, we propose a novel dyna...Video reconstruction quality largely depends on the ability of employed sparse domain to adequately represent the underlying video in Distributed Compressed Video Sensing (DCVS). In this paper, we propose a novel dynamic global-Principal Component Analysis (PCA) sparse representation algorithm for video based on the sparse-land model and nonlocal similarity. First, grouping by matching is realized at the decoder from key frames that are previously recovered. Second, we apply PCA to each group (sub-dataset) to compute the principle components from which the sub-dictionary is constructed. Finally, the non-key frames are reconstructed from random measurement data using a Compressed Sensing (CS) reconstruction algorithm with sparse regularization. Experimental results show that our algorithm has a better performance compared with the DCT and K-SVD dictionaries.展开更多
Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval per...Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval performance, as it determines the quality of the visual content representation. Traditional sampling methods, such as uniform sampling and optical flow-based techniques, often fail to capture the full semantic range of videos, leading to redundancy and inefficiencies. In this work, we propose CLIP4Video-Sampling: Global Semantics-Guided Multi-Granularity Frame Sampling for Video-Text Retrieval, a global semantics-guided multi-granularity frame sampling strategy designed to optimize both computational efficiency and retrieval accuracy. By integrating multi-scale global and local temporal sampling and leveraging the CLIP (Contrastive Language-Image Pre-training) model’s powerful feature extraction capabilities, our method significantly outperforms existing approaches in both zero-shot and fine-tuned video-text retrieval tasks on popular datasets. CLIP4Video-Sampling reduces redundancy, ensures keyframe coverage, and serves as an adaptable pre-processing module for multimodal models.展开更多
基金supported by the Innovation Project of Graduate Students of Jiangsu Province, China under Grants No. CXZZ12_0466, No. CXZZ11_0390the National Natural Science Foundation of China under Grants No. 61071091, No. 61271240, No. 61201160, No. 61172118+2 种基金the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province, China under Grant No. 12KJB510019the Science and Technology Research Program of Hubei Provincial Department of Education under Grants No. D20121408, No. D20121402the Program for Research Innovation of Nanjing Institute of Technology Project under Grant No. CKJ20110006
文摘Video reconstruction quality largely depends on the ability of employed sparse domain to adequately represent the underlying video in Distributed Compressed Video Sensing (DCVS). In this paper, we propose a novel dynamic global-Principal Component Analysis (PCA) sparse representation algorithm for video based on the sparse-land model and nonlocal similarity. First, grouping by matching is realized at the decoder from key frames that are previously recovered. Second, we apply PCA to each group (sub-dataset) to compute the principle components from which the sub-dictionary is constructed. Finally, the non-key frames are reconstructed from random measurement data using a Compressed Sensing (CS) reconstruction algorithm with sparse regularization. Experimental results show that our algorithm has a better performance compared with the DCT and K-SVD dictionaries.
文摘Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval performance, as it determines the quality of the visual content representation. Traditional sampling methods, such as uniform sampling and optical flow-based techniques, often fail to capture the full semantic range of videos, leading to redundancy and inefficiencies. In this work, we propose CLIP4Video-Sampling: Global Semantics-Guided Multi-Granularity Frame Sampling for Video-Text Retrieval, a global semantics-guided multi-granularity frame sampling strategy designed to optimize both computational efficiency and retrieval accuracy. By integrating multi-scale global and local temporal sampling and leveraging the CLIP (Contrastive Language-Image Pre-training) model’s powerful feature extraction capabilities, our method significantly outperforms existing approaches in both zero-shot and fine-tuned video-text retrieval tasks on popular datasets. CLIP4Video-Sampling reduces redundancy, ensures keyframe coverage, and serves as an adaptable pre-processing module for multimodal models.