Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturat...Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturation from the original visible image.However,for fast VIF methods,this operation accounts for the majority of the calculation and is the bottleneck preventing faster processing.In this paper,we propose a fast fusion method,FCDFusion,with little color deviation.It preserves color information without color space transformations,by directly operating in RGB color space.It incorporates gamma correction at little extra cost,allowing color and contrast to be rapidly improved.We regard the fusion process as a scaling operation on 3D color vectors,greatly simplifying the calculations.A theoretical analysis and experiments show that our method can achieve satisfactory results in only 7 FLOPs per pixel.Compared to state-of-theart fast,color-preserving methods using HSV color space,our method provides higher contrast at only half of the computational cost.We further propose a new metric,color deviation,to measure the ability of a VIF method to preserve color.It is specifically designed for VIF tasks with color visible-light images,and overcomes deficiencies of existing VIF metrics used for this purpose.Our code is available at https://github.com/HeasonLee/FCDFusion.展开更多
The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,call...The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach.展开更多
Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine ...Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine these societies as consisting of a collection of multimodal neural networks,including large language models,which engage in a“mindstorm”to solve problems using a shared natural language interface.Here,we work to identify and discuss key questions about the social structure,governance,and economic principles for NLSOMs,emphasizing their impact on the future of AI.Our demonstrations with NLSOMs—which feature up to 129 agents—show their effectiveness in various tasks,including visual question answering,image captioning,and prompt generation for text-to-image synthesis.展开更多
Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose...Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose LucIE,a novel unsupervised language-guided local image editing method for fashion images.LucIE adopts and modifies recent text-to-image synthesis network,DF-GAN,as its backbone.However,the synthesis backbone often changes the global structure of the input image,making local image editing impractical.To increase structural consistency between input and edited images,we propose Content-Preserving Fusion Module(CPFM).Different from existing fusion modules,CPFM prevents iterative refinement on visual feature maps and accumulates additive modifications on RGB maps.LucIE achieves local image editing explicitly with language-guided image segmentation and maskguided image blending while only using image and text pairs.Results on the DeepFashion dataset shows that LucIE achieves state-of-the-art results.Compared with previous methods,images generated by LucIE also exhibit fewer artifacts.We provide visualizations and perform ablation studies to validate LucIE and the CPFM.We also demonstrate and analyze limitations of LucIE,to provide a better understanding of LucIE.展开更多
Storyboards comprising key illustrations and images help filmmakers to outline ideas,key moments,and story events when filming movies.Inspired by this,we introduce the first contextual benchmark dataset Script-to-Stor...Storyboards comprising key illustrations and images help filmmakers to outline ideas,key moments,and story events when filming movies.Inspired by this,we introduce the first contextual benchmark dataset Script-to-Storyboard(Sc2St)composed of storyboards to explicitly express story structures in the movie domain,and propose the contextual retrieval task to facilitate movie story understanding.The Sc2St dataset contains fine-grained and diverse texts,annotated semantic keyframes,and coherent storylines in storyboards,unlike existing movie datasets.The contextual retrieval task takes as input a multi-sentence movie script summary with keyframe history and aims to retrieve a future keyframe described by a corresponding sentence to form the storyboard.Compared to classic text-based visual retrieval tasks,this requires capturing the context from the description(script)and keyframe history.We benchmark existing text-based visual retrieval methods on the new dataset and propose a recurrent-based framework with three variants for effective context encoding.Comprehensive experiments demonstrate that our methods compare favourably to existing methods;ablation studies validate the effectiveness of the proposed context encoding approaches.展开更多
Few-shot classification models trained with clean samples poorly classify samples from the real world with various scales of noise.To enhance the model for recognizing noisy samples,researchers usually utilize data au...Few-shot classification models trained with clean samples poorly classify samples from the real world with various scales of noise.To enhance the model for recognizing noisy samples,researchers usually utilize data augmentation or use noisy samples generated by adversarial training for model training.However,existing methods still have problems:(i)The effects of data augmentation on the robustness of the model are limited.(ii)The noise generated by adversarial training usually causes overfitting and reduces the generalization ability of the model,which is very significant for few-shot classification.(iii)Most existing methods cannot adaptively generate appropriate noise.Given the above three points,this paper proposes a noise-robust few-shot classification algorithm,VADA—Variational Adversarial Data Augmentation.Unlike existing methods,VADA utilizes a variational noise generator to generate an adaptive noise distribution according to different samples based on adversarial learning,and optimizes the generator by minimizing the expectation of the empirical risk.Applying VADA during training can make few-shot classification more robust against noisy data,while retaining generalization ability.In this paper,we utilize FEAT and ProtoNet as baseline models,and accuracy is verified on several common few-shot classification datasets,including MiniImageNet,TieredImageNet,and CUB.After training with VADA,the classification accuracy of the models increases for samples with various scales of noise.展开更多
Denoising diffusion models have demonstrated tremendous success in modeling data distributions and synthesizing high-quality samples.In the 2D image domain,they have become the state-of-the-art and are capable of gene...Denoising diffusion models have demonstrated tremendous success in modeling data distributions and synthesizing high-quality samples.In the 2D image domain,they have become the state-of-the-art and are capable of generating photo-realistic images with high controllability.More recently,researchers have begun to explore how to utilize diffusion models to generate 3D data,as doing so has more potential in real-world applications.This requires careful design choices in two key ways:identifying a suitable 3D representation and determining how to apply the diffusion process.In this survey,we provide the first comprehensive review of diffusion models for manipulating 3D content,including 3D generation,reconstruction,and 3D-aware image synthesis.We classify existing methods into three major categories:2D space diffusion with pretrained models,2D space diffusion without pretrained models,and 3D space diffusion.We also summarize popular datasets used for 3D generation with diffusion models.Along with this survey,we maintain a repository https://github.com/cwchenwang/awesome-3d-diffusion to track the latest relevant papers and codebases.Finally,we pose current challenges for diffusion models for 3D generation,and suggest future research directions.展开更多
Real-world blind image super-resolution is a challenging problem due to the absence of target high resolution images for training.Inspired by the recent success of the single image generation based method SinGAN,we ta...Real-world blind image super-resolution is a challenging problem due to the absence of target high resolution images for training.Inspired by the recent success of the single image generation based method SinGAN,we tackle this challenging problem with a refined model SR-SinGAN,which can learn to perform single real image super-resolution.Firstly,we empirically find that downsampled LR input with an appropriate size can improve the robustness of the generation model.Secondly,we introduce a global contextual prior to provide semantic information.This helps to remove distorted pixels and improve the output fidelity.Finally,we design an image gradient based local contextual prior to guide detail generation.It can alleviate generated artifacts in smooth areas while preserving rich details in densely textured regions(e.g.,hair,grass).To evaluate the effectiveness of these contextual priors,we conducted extensive experiments on both artificial and real images.Results show that these priors can stabilize training and preserve output fidelity,improving the generated image quality.We furthermore find that these single image generation based methods work better for images with repeated textures compared to general images.展开更多
In this study,we propose a novel method to reconstruct the 3D shapes of transparent objects using images captured by handheld cameras under natural lighting conditions.It combines the advantages of an explicit mesh an...In this study,we propose a novel method to reconstruct the 3D shapes of transparent objects using images captured by handheld cameras under natural lighting conditions.It combines the advantages of an explicit mesh and multi-layer perceptron(MLP)network as a hybrid representation to simplify the capture settings used in recent studies.After obtaining an initial shape through multi-view silhouettes,we introduced surface-based local MLPs to encode the vertex displacement field(VDF)for reconstructing surface details.The design of local MLPs allowed representation of the VDF in a piecewise manner using two-layer MLP networks to support the optimization algorithm.Defining local MLPs on the surface instead of on the volume also reduced the search space.Such a hybrid representation enabled us to relax the ray–pixel correspondences that represent the light path constraint to our designed ray–cell correspondences,which significantly simplified the implementation of a single-image-based environment-matting algorithm.We evaluated our representation and reconstruction algorithm on several transparent objects based on ground truth models.The experimental results show that our method produces high-quality reconstructions that are superior to those of state-of-the-art methods using a simplified data-acquisition setup.展开更多
Point cloud completion aims to infer complete point clouds based on partial 3D point cloud inputs.Various previous methods apply coarseto-fine strategy networks for generating complete point clouds.However,such method...Point cloud completion aims to infer complete point clouds based on partial 3D point cloud inputs.Various previous methods apply coarseto-fine strategy networks for generating complete point clouds.However,such methods are not only relatively time-consuming but also cannot provide representative complete shape features based on partial inputs.In this paper,a novel feature alignment fast point cloud completion network(FACNet)is proposed to directly and efficiently generate the detailed shapes of objects.FACNet aligns high-dimensional feature distributions of both partial and complete point clouds to maintain global information about the complete shape.During its decoding process,the local features from the partial point cloud are incorporated along with the maintained global information to ensure complete and time-saving generation of the complete point cloud.Experimental results show that FACNet outperforms the state-of-theart on PCN,Completion3D,and MVP datasets,and achieves competitive performance on ShapeNet-55 and KITTI datasets.Moreover,FACNet and a simplified version,FACNet-slight,achieve a significant speedup of 3–10 times over other state-of-the-art methods.展开更多
Rain streaks in an image appear in different sizes and orientations,resulting in severe blurring and visual quality degradation.Previous CNNbased algorithms have achieved encouraging deraining results although there a...Rain streaks in an image appear in different sizes and orientations,resulting in severe blurring and visual quality degradation.Previous CNNbased algorithms have achieved encouraging deraining results although there are certain limitations in the description of rain streaks and the restoration of scene structures in different environments.In this paper,we propose an efficient multi-scale enhancement and aggregation network(MEAN)to solve the single-image deraining problem.Considering the importance of large receptive fields and multi-scale features,we introduce a multi-scale enhanced unit(MEU)to capture longrange dependencies and exploit features at different scales to depict rain.Simultaneously,an attentive aggregation unit(AAU)is designed to utilize the informative features in spatial and channel dimensions,thereby aggregating effective information to eliminate redundant features for rich scenario details.To improve the deraining performance of the encoder–decoder network,we utilized an AAU to filter the information in the encoder network and concatenated the useful features to the decoder network,which is conducive to predicting high-quality clean images.Experimental results on synthetic datasets and real-world samples show that the proposed method achieves a significant deraining performance compared to state-of-the-art approaches.展开更多
The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position an...The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position and viewpoint-conditioned neural networks,3D Gaussian splatting utilizes a set of Gaussian ellipsoids to model the scene so that efficient rendering can be accomplished by rasterizing Gaussian ellipsoids into images.Apart from fast rendering,the explicit representation of 3D Gaussian splatting also facilitates downstream tasks like dynamic reconstruction,geometry editing,and physical simulation.Considering the rapid changes and growing number of works in this field,we present a literature review of recent 3D Gaussian splatting methods,which can be roughly classified by functionality into 3D reconstruction,3D editing,and other downstream applications.Traditional point-based rendering methods and the rendering formulation of 3D Gaussian splatting are also covered to aid understanding of this technique.This survey aims to help beginners to quickly get started in this field and to provide experienced researchers with a comprehensive overview,aiming to stimulate future development of the 3D Gaussian splatting representation.展开更多
Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from mul...Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from multiple modalities to handle specific scenes,with promising research prospects for emerging methods and benchmarks.To provide a thorough review of multi-modal tracking,different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy,with specific focus on visibledepth(RGB-D)and visible-thermal(RGB-T)tracking.Subsequently,a detailed description of the related benchmarks and challenges is provided.Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets:PTB,VOT19-RGBD,GTOT,RGBT234,and VOT19-RGBT.Finally,various future directions,including model design and dataset construction,are discussed from different perspectives for further research.展开更多
Recent studies have indicated that foundation models, such as BERT and GPT, excel atadapting to various downstream tasks. This adaptability has made them a dominant force in buildingartificial intelligence (AI) system...Recent studies have indicated that foundation models, such as BERT and GPT, excel atadapting to various downstream tasks. This adaptability has made them a dominant force in buildingartificial intelligence (AI) systems. Moreover, a newresearch paradigm has emerged as visualizationtechniques are incorporated into these models. Thisstudy divides these intersections into two researchareas: visualization for foundation model (VIS4FM)and foundation model for visualization (FM4VIS).In terms of VIS4FM, we explore the primary roleof visualizations in understanding, refining, and evaluating these intricate foundation models. VIS4FMaddresses the pressing need for transparency, explainability, fairness, and robustness. Conversely, in termsof FM4VIS, we highlight how foundation models canbe used to advance the visualization field itself. Theintersection of foundation models with visualizations ispromising but also introduces a set of challenges. Byhighlighting these challenges and promising opportunities, this study aims to provide a starting point forthe continued exploration of this research avenue.展开更多
Video colorization is a challenging and highly ill-posed problem.Although recent years have witnessed remarkable progress in single image colorization,there is relatively less research effort on video colorization,and...Video colorization is a challenging and highly ill-posed problem.Although recent years have witnessed remarkable progress in single image colorization,there is relatively less research effort on video colorization,and existing methods always suffer from severe flickering artifacts(temporal inconsistency)or unsatisfactory colorization.We address this problem from a new perspective,by jointly considering colorization and temporal consistency in a unified framework.Specifically,we propose a novel temporally consistent video colorization(TCVC)framework.TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization.Furthermore,TCVC introduces a self-regularization learning(SRL)scheme to minimize the differences in predictions obtained using different time steps.SRL does not require any ground-truth color videos for training and can further improve temporal consistency.Experiments demonstrate that our method can not only provide visually pleasing colorized video,but also with clearly better temporal consistency than state-of-the-art methods.A video demo is provided at https://www.youtube.com/watch?v=c7dczMs-olE,while code is available at https://github.com/lyh-18/TCVC-Tem porally-Consistent-Video-Colorization.展开更多
Template matching is a fundamental task in computer vision and has been studied for decades.It plays an essential role in manufacturing industry for estimating the poses of different parts,facilitating downstream task...Template matching is a fundamental task in computer vision and has been studied for decades.It plays an essential role in manufacturing industry for estimating the poses of different parts,facilitating downstream tasks such as robotic grasping.Existing methods fail when the template and source images have different modalities,cluttered backgrounds,or weak textures.They also rarely consider geometric transformations via homographies,which commonly exist even for planar industrial parts.To tackle the challenges,we propose an accurate template matching method based on differentiable coarse-tofine correspondence refinement.We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image,allowing robust matching.An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers.This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation.Extensive evaluation shows that our method to be significantly better than state-of-the-art methods and baselines,providing good generalization ability and visually plausible results even on unseen real data.展开更多
The point pair feature(PPF)is widely used for 6D pose estimation.In this paper,we propose an efficient 6D pose estimation method based on the PPF framework.We introduce a well-targeted down-sampling strategy that focu...The point pair feature(PPF)is widely used for 6D pose estimation.In this paper,we propose an efficient 6D pose estimation method based on the PPF framework.We introduce a well-targeted down-sampling strategy that focuses on edge areas for efficient feature extraction for complex geometry.A pose hypothesis validation approach is proposed to resolve ambiguity due to symmetry by calculating the edge matching degree.We perform evaluations on two challenging datasets and one real-world collected dataset,demonstrating the superiority of our method for pose estimation for geometrically complex,occluded,symmetrical objects.We further validate our method by applying it to simulated punctures.展开更多
Distinguishing identity-unrelated background information from discriminative identity information poses a challenge in unsupervised vehicle re-identification(Re-ID).Re-ID models suffer from varying degrees of backgrou...Distinguishing identity-unrelated background information from discriminative identity information poses a challenge in unsupervised vehicle re-identification(Re-ID).Re-ID models suffer from varying degrees of background interference caused by continuous scene variations.The recently proposed segment anything model(SAM)has demonstrated exceptional performance in zero-shot segmentation tasks.The combination of SAM and vehicle Re-ID models can achieve efficient separation of vehicle identity and background information.This paper proposes a method that combines SAM-driven mask autoencoder(MAE)pre-training and backgroundaware meta-learning for unsupervised vehicle Re-ID.The method consists of three sub-modules.First,the segmentation capacity of SAM is utilized to separate the vehicle identity region from the background.SAM cannot be robustly employed in exceptional situations,such as those with ambiguity or occlusion.Thus,in vehicle Re-ID downstream tasks,a spatiallyconstrained vehicle background segmentation method is presented to obtain accurate background segmentation results.Second,SAM-driven MAE pre-training utilizes the aforementioned segmentation results to select patches belonging to the vehicle and to mask other patches,allowing MAE to learn identity-sensitive features in a self-supervised manner.Finally,we present a background-aware meta-learning method to fit varying degrees of background interference in different scenarios by combining different background region ratios.Our experiments demonstrate that the proposed method has state-of-the-art performance in reducing background interference variations.展开更多
Recently, facial-expression recognition (FER)has primarily focused on images in the wild, includingfactors such as face occlusion and image blurring, ratherthan laboratory images. Complex field environmentshave introd...Recently, facial-expression recognition (FER)has primarily focused on images in the wild, includingfactors such as face occlusion and image blurring, ratherthan laboratory images. Complex field environmentshave introduced new challenges to FER. To addressthese challenges, this study proposes a cross-fusion dualattention network. The network comprises three parts:(1) a cross-fusion grouped dual-attention mechanism torefine local features and obtain global information;(2) aproposed C2 activation function construction method,which is a piecewise cubic polynomial with threedegrees of freedom, requiring less computation withimproved flexibility and recognition abilities, whichcan better address slow running speeds and neuroninactivation problems;and (3) a closed-loop operationbetween the self-attention distillation process andresidual connections to suppress redundant informationand improve the generalization ability of the model.The recognition accuracies on the RAF-DB, FERPlus,and AffectNet datasets were 92.78%, 92.02%, and63.58%, respectively. Experiments show that this modelcan provide more effective solutions for FER tasks.展开更多
Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image region...Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities,and also insufficiently consider relationships between the hierarchical multi-granularity labels.We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation(MGSG)approach for the hierarchical multi-granularity image classification task.Specifically,we introduce a transformer architecture to encode the image into visual representation sequences.Next,we traverse the taxonomic tree and organize the multi-granularity labels into sequences,and vectorize them and add positional information.The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs,and outputs the predicted multi-granularity label sequence.The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism,and relates visual information to the semantic label information through a crossmodality attention mechanism.In this way,the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities.Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method.Our project is available at https://github.com/liuxindazz/mgs.展开更多
基金supported by the National Natural Science Foundation of China under Grant Nos.62171038,61827901,and 62088101.
文摘Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturation from the original visible image.However,for fast VIF methods,this operation accounts for the majority of the calculation and is the bottleneck preventing faster processing.In this paper,we propose a fast fusion method,FCDFusion,with little color deviation.It preserves color information without color space transformations,by directly operating in RGB color space.It incorporates gamma correction at little extra cost,allowing color and contrast to be rapidly improved.We regard the fusion process as a scaling operation on 3D color vectors,greatly simplifying the calculations.A theoretical analysis and experiments show that our method can achieve satisfactory results in only 7 FLOPs per pixel.Compared to state-of-theart fast,color-preserving methods using HSV color space,our method provides higher contrast at only half of the computational cost.We further propose a new metric,color deviation,to measure the ability of a VIF method to preserve color.It is specifically designed for VIF tasks with color visible-light images,and overcomes deficiencies of existing VIF metrics used for this purpose.Our code is available at https://github.com/HeasonLee/FCDFusion.
文摘The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach.
基金supported by the European Research Council(ERC,Advanced Grant Number 742870the Swiss National Science Foundation(SNF,Grant Numbers 200021 and 192356)the National Natural Science Foundation of China(Grant Number 62476143).
文摘Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine these societies as consisting of a collection of multimodal neural networks,including large language models,which engage in a“mindstorm”to solve problems using a shared natural language interface.Here,we work to identify and discuss key questions about the social structure,governance,and economic principles for NLSOMs,emphasizing their impact on the future of AI.Our demonstrations with NLSOMs—which feature up to 129 agents—show their effectiveness in various tasks,including visual question answering,image captioning,and prompt generation for text-to-image synthesis.
文摘Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose LucIE,a novel unsupervised language-guided local image editing method for fashion images.LucIE adopts and modifies recent text-to-image synthesis network,DF-GAN,as its backbone.However,the synthesis backbone often changes the global structure of the input image,making local image editing impractical.To increase structural consistency between input and edited images,we propose Content-Preserving Fusion Module(CPFM).Different from existing fusion modules,CPFM prevents iterative refinement on visual feature maps and accumulates additive modifications on RGB maps.LucIE achieves local image editing explicitly with language-guided image segmentation and maskguided image blending while only using image and text pairs.Results on the DeepFashion dataset shows that LucIE achieves state-of-the-art results.Compared with previous methods,images generated by LucIE also exhibit fewer artifacts.We provide visualizations and perform ablation studies to validate LucIE and the CPFM.We also demonstrate and analyze limitations of LucIE,to provide a better understanding of LucIE.
基金supported by RCUK grant CAMERA(EP/M023281/1,EP/T022523/1)the Centre for Augmented Reasoning(CAR)at the Australian Institute for Machine Learning,and a gift from Adobe.
文摘Storyboards comprising key illustrations and images help filmmakers to outline ideas,key moments,and story events when filming movies.Inspired by this,we introduce the first contextual benchmark dataset Script-to-Storyboard(Sc2St)composed of storyboards to explicitly express story structures in the movie domain,and propose the contextual retrieval task to facilitate movie story understanding.The Sc2St dataset contains fine-grained and diverse texts,annotated semantic keyframes,and coherent storylines in storyboards,unlike existing movie datasets.The contextual retrieval task takes as input a multi-sentence movie script summary with keyframe history and aims to retrieve a future keyframe described by a corresponding sentence to form the storyboard.Compared to classic text-based visual retrieval tasks,this requires capturing the context from the description(script)and keyframe history.We benchmark existing text-based visual retrieval methods on the new dataset and propose a recurrent-based framework with three variants for effective context encoding.Comprehensive experiments demonstrate that our methods compare favourably to existing methods;ablation studies validate the effectiveness of the proposed context encoding approaches.
基金supported by Yunnan Provincial Major Science and Technology Special Plan Projects(202202AD080003)the Natural Science Foundation of Shandong Province(ZR2022MD090).
文摘Few-shot classification models trained with clean samples poorly classify samples from the real world with various scales of noise.To enhance the model for recognizing noisy samples,researchers usually utilize data augmentation or use noisy samples generated by adversarial training for model training.However,existing methods still have problems:(i)The effects of data augmentation on the robustness of the model are limited.(ii)The noise generated by adversarial training usually causes overfitting and reduces the generalization ability of the model,which is very significant for few-shot classification.(iii)Most existing methods cannot adaptively generate appropriate noise.Given the above three points,this paper proposes a noise-robust few-shot classification algorithm,VADA—Variational Adversarial Data Augmentation.Unlike existing methods,VADA utilizes a variational noise generator to generate an adaptive noise distribution according to different samples based on adversarial learning,and optimizes the generator by minimizing the expectation of the empirical risk.Applying VADA during training can make few-shot classification more robust against noisy data,while retaining generalization ability.In this paper,we utilize FEAT and ProtoNet as baseline models,and accuracy is verified on several common few-shot classification datasets,including MiniImageNet,TieredImageNet,and CUB.After training with VADA,the classification accuracy of the models increases for samples with various scales of noise.
文摘Denoising diffusion models have demonstrated tremendous success in modeling data distributions and synthesizing high-quality samples.In the 2D image domain,they have become the state-of-the-art and are capable of generating photo-realistic images with high controllability.More recently,researchers have begun to explore how to utilize diffusion models to generate 3D data,as doing so has more potential in real-world applications.This requires careful design choices in two key ways:identifying a suitable 3D representation and determining how to apply the diffusion process.In this survey,we provide the first comprehensive review of diffusion models for manipulating 3D content,including 3D generation,reconstruction,and 3D-aware image synthesis.We classify existing methods into three major categories:2D space diffusion with pretrained models,2D space diffusion without pretrained models,and 3D space diffusion.We also summarize popular datasets used for 3D generation with diffusion models.Along with this survey,we maintain a repository https://github.com/cwchenwang/awesome-3d-diffusion to track the latest relevant papers and codebases.Finally,we pose current challenges for diffusion models for 3D generation,and suggest future research directions.
文摘Real-world blind image super-resolution is a challenging problem due to the absence of target high resolution images for training.Inspired by the recent success of the single image generation based method SinGAN,we tackle this challenging problem with a refined model SR-SinGAN,which can learn to perform single real image super-resolution.Firstly,we empirically find that downsampled LR input with an appropriate size can improve the robustness of the generation model.Secondly,we introduce a global contextual prior to provide semantic information.This helps to remove distorted pixels and improve the output fidelity.Finally,we design an image gradient based local contextual prior to guide detail generation.It can alleviate generated artifacts in smooth areas while preserving rich details in densely textured regions(e.g.,hair,grass).To evaluate the effectiveness of these contextual priors,we conducted extensive experiments on both artificial and real images.Results show that these priors can stabilize training and preserve output fidelity,improving the generated image quality.We furthermore find that these single image generation based methods work better for images with repeated textures compared to general images.
基金supported by“Pioneer”and“Leading Goose”R&D Program of Zhejiang(No.2023C01181)supported by National Natural Science Foundation of China(No.62302134)+1 种基金Zhejiang Provincial Natural Science Foundation(No.LQ24F020031)supported by Information Technology Center and State Key Lab of CAD&CG,Zhejiang University.
文摘In this study,we propose a novel method to reconstruct the 3D shapes of transparent objects using images captured by handheld cameras under natural lighting conditions.It combines the advantages of an explicit mesh and multi-layer perceptron(MLP)network as a hybrid representation to simplify the capture settings used in recent studies.After obtaining an initial shape through multi-view silhouettes,we introduced surface-based local MLPs to encode the vertex displacement field(VDF)for reconstructing surface details.The design of local MLPs allowed representation of the VDF in a piecewise manner using two-layer MLP networks to support the optimization algorithm.Defining local MLPs on the surface instead of on the volume also reduced the search space.Such a hybrid representation enabled us to relax the ray–pixel correspondences that represent the light path constraint to our designed ray–cell correspondences,which significantly simplified the implementation of a single-image-based environment-matting algorithm.We evaluated our representation and reconstruction algorithm on several transparent objects based on ground truth models.The experimental results show that our method produces high-quality reconstructions that are superior to those of state-of-the-art methods using a simplified data-acquisition setup.
基金supported by the Zhuhai Industry-University-Research Project(No.2220004002411)National Key R&D Program of China(No.2021YFE0205700)+3 种基金Science and Technology Development Fund of Macao(Nos.0070/2020/AMJ,00123/2022/A3,and 0096/2023/RIA2)Zhuhai City Polytechnic Research Project(No.2024KYBS02)Shenzhen Science and Technology Innovation Committee(No.SGDX20220530111001006)the University of Macao under Grants MYRG(Nos.GRG2023-00061-FST UMDF and 2022-00084-FST)。
文摘Point cloud completion aims to infer complete point clouds based on partial 3D point cloud inputs.Various previous methods apply coarseto-fine strategy networks for generating complete point clouds.However,such methods are not only relatively time-consuming but also cannot provide representative complete shape features based on partial inputs.In this paper,a novel feature alignment fast point cloud completion network(FACNet)is proposed to directly and efficiently generate the detailed shapes of objects.FACNet aligns high-dimensional feature distributions of both partial and complete point clouds to maintain global information about the complete shape.During its decoding process,the local features from the partial point cloud are incorporated along with the maintained global information to ensure complete and time-saving generation of the complete point cloud.Experimental results show that FACNet outperforms the state-of-theart on PCN,Completion3D,and MVP datasets,and achieves competitive performance on ShapeNet-55 and KITTI datasets.Moreover,FACNet and a simplified version,FACNet-slight,achieve a significant speedup of 3–10 times over other state-of-the-art methods.
基金supported by the National Natural Science Foundation of China(No.61972227)the Natural Science Foundation of Shandong Province(No.ZR201808160102)+4 种基金Shandong Provincial Natural Science Foundation Key Project(No.ZR2020KF015)the Key Research and Development Project of Shandong Province(No.2019GSF109112)the Science and Technology Plan for Young Talents in Colleges and Universities of Shandong Province(No.2020KJN007)the Scientific Research Studio in Colleges and Universities of Ji’nan City(No.2021GXRC092)the Science and Technology Research Program for Colleges and Universities in Shandong Province(No.KJ2018BZN029).
文摘Rain streaks in an image appear in different sizes and orientations,resulting in severe blurring and visual quality degradation.Previous CNNbased algorithms have achieved encouraging deraining results although there are certain limitations in the description of rain streaks and the restoration of scene structures in different environments.In this paper,we propose an efficient multi-scale enhancement and aggregation network(MEAN)to solve the single-image deraining problem.Considering the importance of large receptive fields and multi-scale features,we introduce a multi-scale enhanced unit(MEU)to capture longrange dependencies and exploit features at different scales to depict rain.Simultaneously,an attentive aggregation unit(AAU)is designed to utilize the informative features in spatial and channel dimensions,thereby aggregating effective information to eliminate redundant features for rich scenario details.To improve the deraining performance of the encoder–decoder network,we utilized an AAU to filter the information in the encoder network and concatenated the useful features to the decoder network,which is conducive to predicting high-quality clean images.Experimental results on synthetic datasets and real-world samples show that the proposed method achieves a significant deraining performance compared to state-of-the-art approaches.
基金supported by the National Natural Science Foundation of China(62322210)Beijing Municipal Natural Science Foundation for Distinguished Young Scholars(JQ21013)+1 种基金Beijing Municipal Science and Technology Commission(Z231100005923031)2023 Tencent AI Lab Rhino-Bird Focused Research Program.
文摘The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position and viewpoint-conditioned neural networks,3D Gaussian splatting utilizes a set of Gaussian ellipsoids to model the scene so that efficient rendering can be accomplished by rasterizing Gaussian ellipsoids into images.Apart from fast rendering,the explicit representation of 3D Gaussian splatting also facilitates downstream tasks like dynamic reconstruction,geometry editing,and physical simulation.Considering the rapid changes and growing number of works in this field,we present a literature review of recent 3D Gaussian splatting methods,which can be roughly classified by functionality into 3D reconstruction,3D editing,and other downstream applications.Traditional point-based rendering methods and the rendering formulation of 3D Gaussian splatting are also covered to aid understanding of this technique.This survey aims to help beginners to quickly get started in this field and to provide experienced researchers with a comprehensive overview,aiming to stimulate future development of the 3D Gaussian splatting representation.
基金supported in part by National Natural Science Foundation of China(Nos.U23A20384 and 62022021)in part by Joint Fund of Ministry of Education for Equipment Pre-research(No.8091B032155)+1 种基金in part by the National Defense Basic Scientific Research Program(No.WDZC20215250205)in part by Central Guidance on Local Science and Technology Development Fund of Liaoning Province(No.2022JH6/100100026).
文摘Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from multiple modalities to handle specific scenes,with promising research prospects for emerging methods and benchmarks.To provide a thorough review of multi-modal tracking,different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy,with specific focus on visibledepth(RGB-D)and visible-thermal(RGB-T)tracking.Subsequently,a detailed description of the related benchmarks and challenges is provided.Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets:PTB,VOT19-RGBD,GTOT,RGBT234,and VOT19-RGBT.Finally,various future directions,including model design and dataset construction,are discussed from different perspectives for further research.
基金supported by the National Natural Science Foundation of China(Grant Nos.U21A20469 and 61936002)the National Key R&D Program of China(Grant No.2020YFB2104100)grants from the Institute Guo Qiang,THUIBCS,and BLBCI.
文摘Recent studies have indicated that foundation models, such as BERT and GPT, excel atadapting to various downstream tasks. This adaptability has made them a dominant force in buildingartificial intelligence (AI) systems. Moreover, a newresearch paradigm has emerged as visualizationtechniques are incorporated into these models. Thisstudy divides these intersections into two researchareas: visualization for foundation model (VIS4FM)and foundation model for visualization (FM4VIS).In terms of VIS4FM, we explore the primary roleof visualizations in understanding, refining, and evaluating these intricate foundation models. VIS4FMaddresses the pressing need for transparency, explainability, fairness, and robustness. Conversely, in termsof FM4VIS, we highlight how foundation models canbe used to advance the visualization field itself. Theintersection of foundation models with visualizations ispromising but also introduces a set of challenges. Byhighlighting these challenges and promising opportunities, this study aims to provide a starting point forthe continued exploration of this research avenue.
基金supported by grants from the National Natural Science Foundation of China(61906184)the Joint Lab of CAS–HK,and the Shanghai Committee of Science and Technology,China(20DZ1100800,21DZ1100100).
文摘Video colorization is a challenging and highly ill-posed problem.Although recent years have witnessed remarkable progress in single image colorization,there is relatively less research effort on video colorization,and existing methods always suffer from severe flickering artifacts(temporal inconsistency)or unsatisfactory colorization.We address this problem from a new perspective,by jointly considering colorization and temporal consistency in a unified framework.Specifically,we propose a novel temporally consistent video colorization(TCVC)framework.TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization.Furthermore,TCVC introduces a self-regularization learning(SRL)scheme to minimize the differences in predictions obtained using different time steps.SRL does not require any ground-truth color videos for training and can further improve temporal consistency.Experiments demonstrate that our method can not only provide visually pleasing colorized video,but also with clearly better temporal consistency than state-of-the-art methods.A video demo is provided at https://www.youtube.com/watch?v=c7dczMs-olE,while code is available at https://github.com/lyh-18/TCVC-Tem porally-Consistent-Video-Colorization.
基金supported in part by the National Key R&D Program of China(2018AAA0102200)the National Natural Science Foundation of China(62002375,62002376,62325221,62132021).
文摘Template matching is a fundamental task in computer vision and has been studied for decades.It plays an essential role in manufacturing industry for estimating the poses of different parts,facilitating downstream tasks such as robotic grasping.Existing methods fail when the template and source images have different modalities,cluttered backgrounds,or weak textures.They also rarely consider geometric transformations via homographies,which commonly exist even for planar industrial parts.To tackle the challenges,we propose an accurate template matching method based on differentiable coarse-tofine correspondence refinement.We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image,allowing robust matching.An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers.This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation.Extensive evaluation shows that our method to be significantly better than state-of-the-art methods and baselines,providing good generalization ability and visually plausible results even on unseen real data.
基金This work was supported in part by the National Key R&D Program of China(2018AAA0102200)National Natural Science Foundation of China(62132021,62102435,61902419,62002375,62002376)+2 种基金Natural Science Foundation of Hunan Province of China(2021JJ40696)Huxiang Youth Talent Support Program(2021RC3071)NUDT Research Grants(ZK19-30,ZK22-52).
文摘The point pair feature(PPF)is widely used for 6D pose estimation.In this paper,we propose an efficient 6D pose estimation method based on the PPF framework.We introduce a well-targeted down-sampling strategy that focuses on edge areas for efficient feature extraction for complex geometry.A pose hypothesis validation approach is proposed to resolve ambiguity due to symmetry by calculating the edge matching degree.We perform evaluations on two challenging datasets and one real-world collected dataset,demonstrating the superiority of our method for pose estimation for geometrically complex,occluded,symmetrical objects.We further validate our method by applying it to simulated punctures.
基金supported by the National Natural Science Foundation of China under Grant Nos.62076117 and 62166026the Jiangxi Nos.20224BAB212011,20232BAB212008,and 20232BAB202051.
文摘Distinguishing identity-unrelated background information from discriminative identity information poses a challenge in unsupervised vehicle re-identification(Re-ID).Re-ID models suffer from varying degrees of background interference caused by continuous scene variations.The recently proposed segment anything model(SAM)has demonstrated exceptional performance in zero-shot segmentation tasks.The combination of SAM and vehicle Re-ID models can achieve efficient separation of vehicle identity and background information.This paper proposes a method that combines SAM-driven mask autoencoder(MAE)pre-training and backgroundaware meta-learning for unsupervised vehicle Re-ID.The method consists of three sub-modules.First,the segmentation capacity of SAM is utilized to separate the vehicle identity region from the background.SAM cannot be robustly employed in exceptional situations,such as those with ambiguity or occlusion.Thus,in vehicle Re-ID downstream tasks,a spatiallyconstrained vehicle background segmentation method is presented to obtain accurate background segmentation results.Second,SAM-driven MAE pre-training utilizes the aforementioned segmentation results to select patches belonging to the vehicle and to mask other patches,allowing MAE to learn identity-sensitive features in a self-supervised manner.Finally,we present a background-aware meta-learning method to fit varying degrees of background interference in different scenarios by combining different background region ratios.Our experiments demonstrate that the proposed method has state-of-the-art performance in reducing background interference variations.
基金supported in part by the National Natural Science Foundation of China under Grant Nos.62272281 and 62007017the Special Funds for Taishan Scholars Project under Grant No.tsqn202306274Youth Innovation Technology Project of the Higher School in Shandong Province under Grant No.2019KJN042.
文摘Recently, facial-expression recognition (FER)has primarily focused on images in the wild, includingfactors such as face occlusion and image blurring, ratherthan laboratory images. Complex field environmentshave introduced new challenges to FER. To addressthese challenges, this study proposes a cross-fusion dualattention network. The network comprises three parts:(1) a cross-fusion grouped dual-attention mechanism torefine local features and obtain global information;(2) aproposed C2 activation function construction method,which is a piecewise cubic polynomial with threedegrees of freedom, requiring less computation withimproved flexibility and recognition abilities, whichcan better address slow running speeds and neuroninactivation problems;and (3) a closed-loop operationbetween the self-attention distillation process andresidual connections to suppress redundant informationand improve the generalization ability of the model.The recognition accuracies on the RAF-DB, FERPlus,and AffectNet datasets were 92.78%, 92.02%, and63.58%, respectively. Experiments show that this modelcan provide more effective solutions for FER tasks.
基金supported by National Key R&D Program of China(2019YFC1521102)the National Natural Science Foundation of China(61932003)Beijing Science and Technology Plan(Z221100007722004).
文摘Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities,and also insufficiently consider relationships between the hierarchical multi-granularity labels.We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation(MGSG)approach for the hierarchical multi-granularity image classification task.Specifically,we introduce a transformer architecture to encode the image into visual representation sequences.Next,we traverse the taxonomic tree and organize the multi-granularity labels into sequences,and vectorize them and add positional information.The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs,and outputs the predicted multi-granularity label sequence.The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism,and relates visual information to the semantic label information through a crossmodality attention mechanism.In this way,the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities.Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method.Our project is available at https://github.com/liuxindazz/mgs.