期刊文献+

为您找到了以下期刊:

共找到458篇文章
< 1 2 23 >
每页显示 20 50 100
FCDFusion: A fast, low color deviation method for fusing visible and infrared image pairs
1
作者 Hesong Li Ying Fu computational visual media 2025年第1期195-211,共17页
Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturat... Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturation from the original visible image.However,for fast VIF methods,this operation accounts for the majority of the calculation and is the bottleneck preventing faster processing.In this paper,we propose a fast fusion method,FCDFusion,with little color deviation.It preserves color information without color space transformations,by directly operating in RGB color space.It incorporates gamma correction at little extra cost,allowing color and contrast to be rapidly improved.We regard the fusion process as a scaling operation on 3D color vectors,greatly simplifying the calculations.A theoretical analysis and experiments show that our method can achieve satisfactory results in only 7 FLOPs per pixel.Compared to state-of-theart fast,color-preserving methods using HSV color space,our method provides higher contrast at only half of the computational cost.We further propose a new metric,color deviation,to measure the ability of a VIF method to preserve color.It is specifically designed for VIF tasks with color visible-light images,and overcomes deficiencies of existing VIF metrics used for this purpose.Our code is available at https://github.com/HeasonLee/FCDFusion. 展开更多
关键词 infrared images visible and infrared image fusion(VIF) gamma correction real-time display color metrics color deviation
原文传递
Swin3D: A pretrained transformer backbone for 3D indoor scene understanding
2
作者 Yu-Qi Yang Yu-Xiao Guo +5 位作者 Jian-Yu Xiong Yang Liu Hao Pan Peng-Shuai Wang Xin Tong Baining Guo computational visual media 2025年第1期83-101,共19页
The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,call... The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach. 展开更多
关键词 3D pretraining ponitcloud analysis trans-former backbone Swin Transformer 3D semantic segmentation 3D object detection
原文传递
Mindstorms in natural language-based societies of mind
3
作者 Mingchen Zhuge Haozhe Liu +23 位作者 Francesco Faccio Dylan R.Ashley Róbert Csordás Anand Gopalakrishnan Abdullah Hamdi Hasan Abed Al Kader Hammoud Vincent Herrmann Kazuki Irie Louis Kirsch Bing Li Guohao Li Shuming Liu Jinjie Mai Piotr Piękos Aditya A.Ramesh Imanol Schlag Weimin Shi Aleksandar Stanić Wenyi Wang Yuhui Wang Mengmeng Xu Deng-Ping Fan Bernard Ghanem Jürgen Schmidhuber computational visual media 2025年第1期29-81,共53页
Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine ... Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine these societies as consisting of a collection of multimodal neural networks,including large language models,which engage in a“mindstorm”to solve problems using a shared natural language interface.Here,we work to identify and discuss key questions about the social structure,governance,and economic principles for NLSOMs,emphasizing their impact on the future of AI.Our demonstrations with NLSOMs—which feature up to 129 agents—show their effectiveness in various tasks,including visual question answering,image captioning,and prompt generation for text-to-image synthesis. 展开更多
关键词 mindstorm society of mind(SOM) large languagemodels(LLMs) multimodal learning learning to think
原文传递
LucIE: Language-guided local image editing for fashion images
4
作者 Huanglu Wen Shaodi You Ying Fu computational visual media 2025年第1期179-194,共16页
Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose... Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose LucIE,a novel unsupervised language-guided local image editing method for fashion images.LucIE adopts and modifies recent text-to-image synthesis network,DF-GAN,as its backbone.However,the synthesis backbone often changes the global structure of the input image,making local image editing impractical.To increase structural consistency between input and edited images,we propose Content-Preserving Fusion Module(CPFM).Different from existing fusion modules,CPFM prevents iterative refinement on visual feature maps and accumulates additive modifications on RGB maps.LucIE achieves local image editing explicitly with language-guided image segmentation and maskguided image blending while only using image and text pairs.Results on the DeepFashion dataset shows that LucIE achieves state-of-the-art results.Compared with previous methods,images generated by LucIE also exhibit fewer artifacts.We provide visualizations and perform ablation studies to validate LucIE and the CPFM.We also demonstrate and analyze limitations of LucIE,to provide a better understanding of LucIE. 展开更多
关键词 deep learning language-guided image editing local image editing content preservation fashion images
原文传递
Script-to-Storyboard: A new contextual retrieval dataset and benchmark
5
作者 Xi Tian Yong-Liang Yang Qi Wu computational visual media 2025年第1期103-122,共20页
Storyboards comprising key illustrations and images help filmmakers to outline ideas,key moments,and story events when filming movies.Inspired by this,we introduce the first contextual benchmark dataset Script-to-Stor... Storyboards comprising key illustrations and images help filmmakers to outline ideas,key moments,and story events when filming movies.Inspired by this,we introduce the first contextual benchmark dataset Script-to-Storyboard(Sc2St)composed of storyboards to explicitly express story structures in the movie domain,and propose the contextual retrieval task to facilitate movie story understanding.The Sc2St dataset contains fine-grained and diverse texts,annotated semantic keyframes,and coherent storylines in storyboards,unlike existing movie datasets.The contextual retrieval task takes as input a multi-sentence movie script summary with keyframe history and aims to retrieve a future keyframe described by a corresponding sentence to form the storyboard.Compared to classic text-based visual retrieval tasks,this requires capturing the context from the description(script)and keyframe history.We benchmark existing text-based visual retrieval methods on the new dataset and propose a recurrent-based framework with three variants for effective context encoding.Comprehensive experiments demonstrate that our methods compare favourably to existing methods;ablation studies validate the effectiveness of the proposed context encoding approaches. 展开更多
关键词 DATASET BENCHMARK text-based image retrieval MOVIE
原文传递
Noise-robust few-shot classification via variational adversarial data augmentation
6
作者 Renjie Xu Baodi Liu +3 位作者 Kai Zhang Honglong Chen Dapeng Tao Weifeng Liu computational visual media 2025年第1期227-239,共13页
Few-shot classification models trained with clean samples poorly classify samples from the real world with various scales of noise.To enhance the model for recognizing noisy samples,researchers usually utilize data au... Few-shot classification models trained with clean samples poorly classify samples from the real world with various scales of noise.To enhance the model for recognizing noisy samples,researchers usually utilize data augmentation or use noisy samples generated by adversarial training for model training.However,existing methods still have problems:(i)The effects of data augmentation on the robustness of the model are limited.(ii)The noise generated by adversarial training usually causes overfitting and reduces the generalization ability of the model,which is very significant for few-shot classification.(iii)Most existing methods cannot adaptively generate appropriate noise.Given the above three points,this paper proposes a noise-robust few-shot classification algorithm,VADA—Variational Adversarial Data Augmentation.Unlike existing methods,VADA utilizes a variational noise generator to generate an adaptive noise distribution according to different samples based on adversarial learning,and optimizes the generator by minimizing the expectation of the empirical risk.Applying VADA during training can make few-shot classification more robust against noisy data,while retaining generalization ability.In this paper,we utilize FEAT and ProtoNet as baseline models,and accuracy is verified on several common few-shot classification datasets,including MiniImageNet,TieredImageNet,and CUB.After training with VADA,the classification accuracy of the models increases for samples with various scales of noise. 展开更多
关键词 few-shot learning adversarial learning ROBUSTNESS variational method
原文传递
Diffusion models for 3D generation: A survey
7
作者 Chen Wang Hao-Yang Peng +2 位作者 Ying-Tian Liu Jiatao Gu Shi-Min Hu computational visual media 2025年第1期1-28,共28页
Denoising diffusion models have demonstrated tremendous success in modeling data distributions and synthesizing high-quality samples.In the 2D image domain,they have become the state-of-the-art and are capable of gene... Denoising diffusion models have demonstrated tremendous success in modeling data distributions and synthesizing high-quality samples.In the 2D image domain,they have become the state-of-the-art and are capable of generating photo-realistic images with high controllability.More recently,researchers have begun to explore how to utilize diffusion models to generate 3D data,as doing so has more potential in real-world applications.This requires careful design choices in two key ways:identifying a suitable 3D representation and determining how to apply the diffusion process.In this survey,we provide the first comprehensive review of diffusion models for manipulating 3D content,including 3D generation,reconstruction,and 3D-aware image synthesis.We classify existing methods into three major categories:2D space diffusion with pretrained models,2D space diffusion without pretrained models,and 3D space diffusion.We also summarize popular datasets used for 3D generation with diffusion models.Along with this survey,we maintain a repository https://github.com/cwchenwang/awesome-3d-diffusion to track the latest relevant papers and codebases.Finally,we pose current challenges for diffusion models for 3D generation,and suggest future research directions. 展开更多
关键词 diffusion models 3D generation generative models AIG
原文传递
Exploring contextual priors for real-world image super-resolution
8
作者 Shixiang Wu Chao Dong Yu Qiao computational visual media 2025年第1期159-177,共19页
Real-world blind image super-resolution is a challenging problem due to the absence of target high resolution images for training.Inspired by the recent success of the single image generation based method SinGAN,we ta... Real-world blind image super-resolution is a challenging problem due to the absence of target high resolution images for training.Inspired by the recent success of the single image generation based method SinGAN,we tackle this challenging problem with a refined model SR-SinGAN,which can learn to perform single real image super-resolution.Firstly,we empirically find that downsampled LR input with an appropriate size can improve the robustness of the generation model.Secondly,we introduce a global contextual prior to provide semantic information.This helps to remove distorted pixels and improve the output fidelity.Finally,we design an image gradient based local contextual prior to guide detail generation.It can alleviate generated artifacts in smooth areas while preserving rich details in densely textured regions(e.g.,hair,grass).To evaluate the effectiveness of these contextual priors,we conducted extensive experiments on both artificial and real images.Results show that these priors can stabilize training and preserve output fidelity,improving the generated image quality.We furthermore find that these single image generation based methods work better for images with repeated textures compared to general images. 展开更多
关键词 unsupervised learning blind super-resolution image context image generation
原文传递
Hybrid mesh-neural representation for 3D transparent object reconstruction
9
作者 Jiamin Xu Zihan Zhu +1 位作者 Hujun Bao Weiwei Xu computational visual media 2025年第1期123-140,共18页
In this study,we propose a novel method to reconstruct the 3D shapes of transparent objects using images captured by handheld cameras under natural lighting conditions.It combines the advantages of an explicit mesh an... In this study,we propose a novel method to reconstruct the 3D shapes of transparent objects using images captured by handheld cameras under natural lighting conditions.It combines the advantages of an explicit mesh and multi-layer perceptron(MLP)network as a hybrid representation to simplify the capture settings used in recent studies.After obtaining an initial shape through multi-view silhouettes,we introduced surface-based local MLPs to encode the vertex displacement field(VDF)for reconstructing surface details.The design of local MLPs allowed representation of the VDF in a piecewise manner using two-layer MLP networks to support the optimization algorithm.Defining local MLPs on the surface instead of on the volume also reduced the search space.Such a hybrid representation enabled us to relax the ray–pixel correspondences that represent the light path constraint to our designed ray–cell correspondences,which significantly simplified the implementation of a single-image-based environment-matting algorithm.We evaluated our representation and reconstruction algorithm on several transparent objects based on ground truth models.The experimental results show that our method produces high-quality reconstructions that are superior to those of state-of-the-art methods using a simplified data-acquisition setup. 展开更多
关键词 transparent object 3D reconstruction environment matting neural rendering
原文传递
FACNet: Feature alignment fast point cloud completion network
10
作者 Xinxing Yu Jianyi Li +2 位作者 Chi-Chong Wong Chi-Man Vong Yanyan Liang computational visual media 2025年第1期141-157,共17页
Point cloud completion aims to infer complete point clouds based on partial 3D point cloud inputs.Various previous methods apply coarseto-fine strategy networks for generating complete point clouds.However,such method... Point cloud completion aims to infer complete point clouds based on partial 3D point cloud inputs.Various previous methods apply coarseto-fine strategy networks for generating complete point clouds.However,such methods are not only relatively time-consuming but also cannot provide representative complete shape features based on partial inputs.In this paper,a novel feature alignment fast point cloud completion network(FACNet)is proposed to directly and efficiently generate the detailed shapes of objects.FACNet aligns high-dimensional feature distributions of both partial and complete point clouds to maintain global information about the complete shape.During its decoding process,the local features from the partial point cloud are incorporated along with the maintained global information to ensure complete and time-saving generation of the complete point cloud.Experimental results show that FACNet outperforms the state-of-theart on PCN,Completion3D,and MVP datasets,and achieves competitive performance on ShapeNet-55 and KITTI datasets.Moreover,FACNet and a simplified version,FACNet-slight,achieve a significant speedup of 3–10 times over other state-of-the-art methods. 展开更多
关键词 3D point clouds shape completion geometry processing deep learning
原文传递
Multi-scale enhancement and aggregation network for singleimage deraining
11
作者 Rui Zhang Yuetong Liu +3 位作者 Huijian Han Yong Zheng Tao Zhang Yunfeng Zhang computational visual media 2025年第1期213-226,共14页
Rain streaks in an image appear in different sizes and orientations,resulting in severe blurring and visual quality degradation.Previous CNNbased algorithms have achieved encouraging deraining results although there a... Rain streaks in an image appear in different sizes and orientations,resulting in severe blurring and visual quality degradation.Previous CNNbased algorithms have achieved encouraging deraining results although there are certain limitations in the description of rain streaks and the restoration of scene structures in different environments.In this paper,we propose an efficient multi-scale enhancement and aggregation network(MEAN)to solve the single-image deraining problem.Considering the importance of large receptive fields and multi-scale features,we introduce a multi-scale enhanced unit(MEU)to capture longrange dependencies and exploit features at different scales to depict rain.Simultaneously,an attentive aggregation unit(AAU)is designed to utilize the informative features in spatial and channel dimensions,thereby aggregating effective information to eliminate redundant features for rich scenario details.To improve the deraining performance of the encoder–decoder network,we utilized an AAU to filter the information in the encoder network and concatenated the useful features to the decoder network,which is conducive to predicting high-quality clean images.Experimental results on synthetic datasets and real-world samples show that the proposed method achieves a significant deraining performance compared to state-of-the-art approaches. 展开更多
关键词 single-image deraining multi-scale enhan-cement and aggregation(MEA) encoder-decoder network
原文传递
Recent advances in 3D Gaussian splatting 被引量:7
12
作者 Tong Wu Yu-Jie Yuan +4 位作者 Ling-Xiao Zhang Jie Yang Yan-Pei Cao Ling-Qi Yan Lin Gao computational visual media SCIE EI CSCD 2024年第4期613-642,共30页
The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position an... The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position and viewpoint-conditioned neural networks,3D Gaussian splatting utilizes a set of Gaussian ellipsoids to model the scene so that efficient rendering can be accomplished by rasterizing Gaussian ellipsoids into images.Apart from fast rendering,the explicit representation of 3D Gaussian splatting also facilitates downstream tasks like dynamic reconstruction,geometry editing,and physical simulation.Considering the rapid changes and growing number of works in this field,we present a literature review of recent 3D Gaussian splatting methods,which can be roughly classified by functionality into 3D reconstruction,3D editing,and other downstream applications.Traditional point-based rendering methods and the rendering formulation of 3D Gaussian splatting are also covered to aid understanding of this technique.This survey aims to help beginners to quickly get started in this field and to provide experienced researchers with a comprehensive overview,aiming to stimulate future development of the 3D Gaussian splatting representation. 展开更多
关键词 3D Gaussian splatting(3DGS) radiance field novel view synthesis 3D editing scene generation
原文传递
Multi-modal visual tracking:Review and experimental comparison 被引量:5
13
作者 Pengyu Zhang Dong Wang Huchuan Lu computational visual media SCIE EI CSCD 2024年第2期193-214,共22页
Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from mul... Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from multiple modalities to handle specific scenes,with promising research prospects for emerging methods and benchmarks.To provide a thorough review of multi-modal tracking,different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy,with specific focus on visibledepth(RGB-D)and visible-thermal(RGB-T)tracking.Subsequently,a detailed description of the related benchmarks and challenges is provided.Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets:PTB,VOT19-RGBD,GTOT,RGBT234,and VOT19-RGBT.Finally,various future directions,including model design and dataset construction,are discussed from different perspectives for further research. 展开更多
关键词 visual tracking object tracking multi-modal fusion RGB-T tracking RGB-D trackin
原文传递
Foundation models meet visualizations: Challenges and opportunities 被引量:2
14
作者 Weikai Yang Mengchen Liu +1 位作者 Zheng Wang Shixia Liu computational visual media SCIE EI CSCD 2024年第3期399-424,共26页
Recent studies have indicated that foundation models, such as BERT and GPT, excel atadapting to various downstream tasks. This adaptability has made them a dominant force in buildingartificial intelligence (AI) system... Recent studies have indicated that foundation models, such as BERT and GPT, excel atadapting to various downstream tasks. This adaptability has made them a dominant force in buildingartificial intelligence (AI) systems. Moreover, a newresearch paradigm has emerged as visualizationtechniques are incorporated into these models. Thisstudy divides these intersections into two researchareas: visualization for foundation model (VIS4FM)and foundation model for visualization (FM4VIS).In terms of VIS4FM, we explore the primary roleof visualizations in understanding, refining, and evaluating these intricate foundation models. VIS4FMaddresses the pressing need for transparency, explainability, fairness, and robustness. Conversely, in termsof FM4VIS, we highlight how foundation models canbe used to advance the visualization field itself. Theintersection of foundation models with visualizations ispromising but also introduces a set of challenges. Byhighlighting these challenges and promising opportunities, this study aims to provide a starting point forthe continued exploration of this research avenue. 展开更多
关键词 VISUALIZATION artificial intelligence(AI) machine learning foundation models visualization for foundation model(VIS4FM) foundation model for visualization(FM4VIS)
原文传递
Temporally consistent video colorization with deep feature propagation and self-regularization learning 被引量:2
15
作者 Yihao Liu Hengyuan Zhao +4 位作者 Kelvin CKChan Xintao Wang Chen Change Loy Yu Qiao Chao Dong computational visual media SCIE EI CSCD 2024年第2期375-395,共21页
Video colorization is a challenging and highly ill-posed problem.Although recent years have witnessed remarkable progress in single image colorization,there is relatively less research effort on video colorization,and... Video colorization is a challenging and highly ill-posed problem.Although recent years have witnessed remarkable progress in single image colorization,there is relatively less research effort on video colorization,and existing methods always suffer from severe flickering artifacts(temporal inconsistency)or unsatisfactory colorization.We address this problem from a new perspective,by jointly considering colorization and temporal consistency in a unified framework.Specifically,we propose a novel temporally consistent video colorization(TCVC)framework.TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization.Furthermore,TCVC introduces a self-regularization learning(SRL)scheme to minimize the differences in predictions obtained using different time steps.SRL does not require any ground-truth color videos for training and can further improve temporal consistency.Experiments demonstrate that our method can not only provide visually pleasing colorized video,but also with clearly better temporal consistency than state-of-the-art methods.A video demo is provided at https://www.youtube.com/watch?v=c7dczMs-olE,while code is available at https://github.com/lyh-18/TCVC-Tem porally-Consistent-Video-Colorization. 展开更多
关键词 video colorization temporal consistency feature propagation self-regularization
原文传递
Learning accurate template matching with differentiable coarseto-fine correspondence refinement 被引量:2
16
作者 Zhirui Gao Renjiao Yi +3 位作者 Zheng Qin Yunfan Ye Chenyang Zhu Kai Xu computational visual media SCIE EI CSCD 2024年第2期309-330,共22页
Template matching is a fundamental task in computer vision and has been studied for decades.It plays an essential role in manufacturing industry for estimating the poses of different parts,facilitating downstream task... Template matching is a fundamental task in computer vision and has been studied for decades.It plays an essential role in manufacturing industry for estimating the poses of different parts,facilitating downstream tasks such as robotic grasping.Existing methods fail when the template and source images have different modalities,cluttered backgrounds,or weak textures.They also rarely consider geometric transformations via homographies,which commonly exist even for planar industrial parts.To tackle the challenges,we propose an accurate template matching method based on differentiable coarse-tofine correspondence refinement.We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image,allowing robust matching.An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers.This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation.Extensive evaluation shows that our method to be significantly better than state-of-the-art methods and baselines,providing good generalization ability and visually plausible results even on unseen real data. 展开更多
关键词 template matching differentiable homography structure-awareness TRANSFORMERS
原文传递
6DOF pose estimation of a 3D rigid object based on edge-enhanced point pair features 被引量:2
17
作者 Chenyi Liu Fei Chen +5 位作者 Lu Deng Renjiao Yi Lintao Zheng Chenyang Zhu Jia Wang Kai Xu computational visual media SCIE EI CSCD 2024年第1期61-77,共17页
The point pair feature(PPF)is widely used for 6D pose estimation.In this paper,we propose an efficient 6D pose estimation method based on the PPF framework.We introduce a well-targeted down-sampling strategy that focu... The point pair feature(PPF)is widely used for 6D pose estimation.In this paper,we propose an efficient 6D pose estimation method based on the PPF framework.We introduce a well-targeted down-sampling strategy that focuses on edge areas for efficient feature extraction for complex geometry.A pose hypothesis validation approach is proposed to resolve ambiguity due to symmetry by calculating the edge matching degree.We perform evaluations on two challenging datasets and one real-world collected dataset,demonstrating the superiority of our method for pose estimation for geometrically complex,occluded,symmetrical objects.We further validate our method by applying it to simulated punctures. 展开更多
关键词 point pair feature(PPF) pose estimation object recognition 3D point cloud
原文传递
SAM-drivenMAE pre-training and background-awaremeta-learning for unsupervised vehicle re-identification 被引量:1
18
作者 Dong Wang Qi Wang +4 位作者 Weidong Min Di Gai Qing Han Longfei Li Yuhan Geng computational visual media SCIE EI CSCD 2024年第4期771-789,共19页
Distinguishing identity-unrelated background information from discriminative identity information poses a challenge in unsupervised vehicle re-identification(Re-ID).Re-ID models suffer from varying degrees of backgrou... Distinguishing identity-unrelated background information from discriminative identity information poses a challenge in unsupervised vehicle re-identification(Re-ID).Re-ID models suffer from varying degrees of background interference caused by continuous scene variations.The recently proposed segment anything model(SAM)has demonstrated exceptional performance in zero-shot segmentation tasks.The combination of SAM and vehicle Re-ID models can achieve efficient separation of vehicle identity and background information.This paper proposes a method that combines SAM-driven mask autoencoder(MAE)pre-training and backgroundaware meta-learning for unsupervised vehicle Re-ID.The method consists of three sub-modules.First,the segmentation capacity of SAM is utilized to separate the vehicle identity region from the background.SAM cannot be robustly employed in exceptional situations,such as those with ambiguity or occlusion.Thus,in vehicle Re-ID downstream tasks,a spatiallyconstrained vehicle background segmentation method is presented to obtain accurate background segmentation results.Second,SAM-driven MAE pre-training utilizes the aforementioned segmentation results to select patches belonging to the vehicle and to mask other patches,allowing MAE to learn identity-sensitive features in a self-supervised manner.Finally,we present a background-aware meta-learning method to fit varying degrees of background interference in different scenarios by combining different background region ratios.Our experiments demonstrate that the proposed method has state-of-the-art performance in reducing background interference variations. 展开更多
关键词 UNSUPERVISED re-identification(Re-ID) vehicles segmentation autoencoder META-LEARNING
原文传递
CF-DAN: Facial-expression recognition based on cross-fusion dual-attention network 被引量:1
19
作者 Fan Zhang Gongguan Chen +1 位作者 Hua Wang Caiming Zhang computational visual media SCIE EI CSCD 2024年第3期593-608,共16页
Recently, facial-expression recognition (FER)has primarily focused on images in the wild, includingfactors such as face occlusion and image blurring, ratherthan laboratory images. Complex field environmentshave introd... Recently, facial-expression recognition (FER)has primarily focused on images in the wild, includingfactors such as face occlusion and image blurring, ratherthan laboratory images. Complex field environmentshave introduced new challenges to FER. To addressthese challenges, this study proposes a cross-fusion dualattention network. The network comprises three parts:(1) a cross-fusion grouped dual-attention mechanism torefine local features and obtain global information;(2) aproposed C2 activation function construction method,which is a piecewise cubic polynomial with threedegrees of freedom, requiring less computation withimproved flexibility and recognition abilities, whichcan better address slow running speeds and neuroninactivation problems;and (3) a closed-loop operationbetween the self-attention distillation process andresidual connections to suppress redundant informationand improve the generalization ability of the model.The recognition accuracies on the RAF-DB, FERPlus,and AffectNet datasets were 92.78%, 92.02%, and63.58%, respectively. Experiments show that this modelcan provide more effective solutions for FER tasks. 展开更多
关键词 facial-expression recognition(FER) cubic polynomial activation function dualattention mechanism interactive learning self-attention distillation
原文传递
Multi-granularity sequence generation for hierarchical image classification 被引量:1
20
作者 Xinda Liu Lili Wang computational visual media SCIE EI CSCD 2024年第2期243-260,共18页
Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image region... Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities,and also insufficiently consider relationships between the hierarchical multi-granularity labels.We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation(MGSG)approach for the hierarchical multi-granularity image classification task.Specifically,we introduce a transformer architecture to encode the image into visual representation sequences.Next,we traverse the taxonomic tree and organize the multi-granularity labels into sequences,and vectorize them and add positional information.The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs,and outputs the predicted multi-granularity label sequence.The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism,and relates visual information to the semantic label information through a crossmodality attention mechanism.In this way,the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities.Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method.Our project is available at https://github.com/liuxindazz/mgs. 展开更多
关键词 hierarchical multi-granularity classification vision and text transformer sequence generation fine-grained image recognition cross-modality attenti
原文传递
上一页 1 2 23 下一页 到第
使用帮助 返回顶部