The Computational Visual Media(CVM)conference series provides a leading international forum for the exchange of innovative research ideas and significant computational methodologies that both underpin and advance visu...The Computational Visual Media(CVM)conference series provides a leading international forum for the exchange of innovative research ideas and significant computational methodologies that both underpin and advance visual media.Its primary mission is to foster cross-disciplinary research that integrates computer graphics,computer vision,machine learning,image and video processing,visualization,and geometric computing.Topics of particular interest include classification,composition,retrieval,synthesis,cognition,and understanding of visual media,encompassing images,video,and 3D geometry.展开更多
Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply...Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply combining multiple 3D datasets for pretraining a 3D backbone does not yield significant improvement,due to the domain discrepancies among different 3D datasets that impede effective feature learning.In this work,we identify the main sources of the domain discrepancies between 3D indoor scene datasets,and propose Swin3d++,an enhanced architecture based on Swin3d for efficient pretraining on multi-source 3D point clouds.Swin3d++introduces domain-specific mechanisms to SWIN3D's modules to address domain discrepancies and enhance the network capability on multi-source pretraining.Moreover,we devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining.We validate the effectiveness of our design,and demonstrate that Swin3d++surpasses the state-of-the-art 3D pretraining methods on typical indoor scene understanding tasks.展开更多
Estimating 3D human pose from 2D images in real world contexts remains a challenge,characterized by unique data constraints.Large general datasets of motion-captured 3D adult human poses paired with 2D images exist,bu...Estimating 3D human pose from 2D images in real world contexts remains a challenge,characterized by unique data constraints.Large general datasets of motion-captured 3D adult human poses paired with 2D images exist,but in many application settings,collection of further motion-captured data is impossible,precluding a straightforward fine-tuning approach to adaptation.We present a method for improving 3D pose estimation transfer learning to domains where there are only depth camera images available as supervision.Our heuristic weakly supervised 3D human pose(HW-HuP)estimation method learns partial pose priors from general 3D human pose datasets and employs weak supervision with depth data to guide learning in an optimization and regression cycle.We show that HW-HuP meaningfully improves upon state-of-the-art models in the adult in-bed setting,as well as on large scale public 3D human pose datasets,under comparable supervision conditions.Our model code and data are publicly available at https://github.com/ostadabbas/hw-hup.A significantly expanded version of this paper,with supplementary material,is available as a preprint on arXiv at https://arxiv.org/abs/2105.10996.展开更多
Semantic image synthesis aims to generate highquality images given semantic conditions,i.e.,segmentation masks and style reference images.Existing methods widely adopt generative adversarial networks(GANs).GANs take a...Semantic image synthesis aims to generate highquality images given semantic conditions,i.e.,segmentation masks and style reference images.Existing methods widely adopt generative adversarial networks(GANs).GANs take all conditional inputs and directly synthesize images in a single forward step.In this paper,semantic image synthesis is treated as an image denoising task and is handled with a novel image-to-image diffusion model(IIDM).展开更多
Diffusion-based models have recently achieved remarkable success in style transfer. However, when training data is scarce, existing methods struggle to effectively balance style and content. In this paper, we propose ...Diffusion-based models have recently achieved remarkable success in style transfer. However, when training data is scarce, existing methods struggle to effectively balance style and content. In this paper, we propose Style-Aware Diffusion (SAD), a novel method that harnesses efficient low-rank adaptation training techniques. Specifically, We extract latent representations of both style and content using DDIM inversion, formulated as an ordinary differential equation. Then, we use adaptive instance normalization and query–key–value injection to effectively align low-level style features with high-level content semantics. In addition, we propose parameter-efficient adaptation, which mitigates catastrophic forgetting and overfitting by rationally optimizing the weights of the attention layers, ensuring robust and effective performance, and achieving a 61.5% relative score increase over the plain model. The proposed method outperforms the high-performance DreamBooth-LoRA model and won the Fourth Jittor Artificial Intelligence Challenge. Our model is implemented using the Jittor framework and is available at https://github.com/liylo/jittor-qwqw-Few_Shot_Style_Transfer.展开更多
Learning-based multiple view stereo has gained significant attention recently.However,most methods rely on direct network supervision using provided ground-truth depth,which poses three inherent problems:resolution-de...Learning-based multiple view stereo has gained significant attention recently.However,most methods rely on direct network supervision using provided ground-truth depth,which poses three inherent problems:resolution-dependent ground-truth artifacts,excessively challenging training examples(with relatively featureless textures),and use of less-viewed reference pixels for supervision,all of which hinder network optimization.To alleviate these problems,we propose an accurate network supervision paradigm that includes a ground-truth mask,an entropy mask,and a consistency mask,which provide more accurate supervision signals to aid network optimization.展开更多
Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturat...Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturation from the original visible image.However,for fast VIF methods,this operation accounts for the majority of the calculation and is the bottleneck preventing faster processing.In this paper,we propose a fast fusion method,FCDFusion,with little color deviation.It preserves color information without color space transformations,by directly operating in RGB color space.It incorporates gamma correction at little extra cost,allowing color and contrast to be rapidly improved.We regard the fusion process as a scaling operation on 3D color vectors,greatly simplifying the calculations.A theoretical analysis and experiments show that our method can achieve satisfactory results in only 7 FLOPs per pixel.Compared to state-of-theart fast,color-preserving methods using HSV color space,our method provides higher contrast at only half of the computational cost.We further propose a new metric,color deviation,to measure the ability of a VIF method to preserve color.It is specifically designed for VIF tasks with color visible-light images,and overcomes deficiencies of existing VIF metrics used for this purpose.Our code is available at https://github.com/HeasonLee/FCDFusion.展开更多
Large-scale unsupervised semantic segmentation(LUSS)is a sophisticated process that aims to segment similar areas within an image without relying on labeled training data.While existing methodologies have made substan...Large-scale unsupervised semantic segmentation(LUSS)is a sophisticated process that aims to segment similar areas within an image without relying on labeled training data.While existing methodologies have made substantial progress in this area,there is ample scope for enhancement.We thus introduce the PASS-SAM model,a comprehensive solution that amalgamates the benefits of various models to improve segmentation performance.展开更多
While recent Gaussian-based SLAM methods achieve photorealistic reconstruction from RGB-D data, their computational performance remains a critical bottleneck. State-of-the-art techniques operate at less than 20 fps, s...While recent Gaussian-based SLAM methods achieve photorealistic reconstruction from RGB-D data, their computational performance remains a critical bottleneck. State-of-the-art techniques operate at less than 20 fps, significantly lagging behind geometry-based approaches like KinectFusion (hundreds of fps). This limitation stems from the heavy computational burden: modeling scenes requires numerous Gaussians and complex iterative optimization to fit RGB-D data;insufficient Gaussian counts or optimization iterations cause severe quality degradation. To address this, we propose a Gaussian-SDF hybrid representation, combining a colorized signed distance field (SDF) for smooth geometry and appearance with 3D Gaussians to capture underrepresented details. The SDF is efficiently constructed via RGB-D fusion (as in geometry-based methods), while Gaussians undergo iterative optimization. Our representation enables significant Gaussian reduction (50% fewer) by avoiding full-scene Gaussian modeling, and efficient Gaussian optimization (75% fewer iterations) through targeted appearance refinement. Building upon this representation, we develop GPS-SLAM (Gaussian-plus-SDF SLAM), a real-time 3D reconstruction system achieving over 150 fps on real-world Azure Kinect sequences, faster by an order-of-magnitude than state-of-the-art techniques while maintaining comparable reconstruction quality. The source code and data are available at https://gapszju.github.io/GPS-SLAM.展开更多
We introduce continuous indexed points for improved multivariate volume visualization.Indexed points represent linear structures in parallel coordinates and can be used to encode local correlation of multivariate(incl...We introduce continuous indexed points for improved multivariate volume visualization.Indexed points represent linear structures in parallel coordinates and can be used to encode local correlation of multivariate(including multi-field,multifaceted,and multi-attribute)volume data.First,we perform local linear fitting in the spatial neighborhood of each volume sample using principal component analysis,accelerated by hierarchical spatial data structures.This local linear information is then visualized as continuous indexed points in parallel coordinates:a density representation of indexed points in a continuous domain.With our new method,multivariate volume data can be analyzed using eigenvector information from local spatial embeddings.We utilize both 1-flat and 2-flat indexed points,allowing us to identify correlations between two variables and even three variables,respectively.An interactive occlusion shading model facilitates good spatial perception of the volume rendering of volumetric correlation characteristics.Interactive exploration is supported by specifically designed multivariate transfer function widgets working in the image plane of parallel coordinates.We show that our generic technique works for multi-attribute datasets.The effectiveness and usefulness of our new method is demonstrated through a case study,an expert user study,and domain expert feedback.展开更多
The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,call...The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach.展开更多
A learning-based adaptive loop filter is developed for the geometry-based point-cloud compression(G-PCC)standard to reduce attribute compression artifacts.The proposed method first generates multiple most probable sam...A learning-based adaptive loop filter is developed for the geometry-based point-cloud compression(G-PCC)standard to reduce attribute compression artifacts.The proposed method first generates multiple most probable sample offsets(MPSOs)as potential compression distortion approximations,and then linearly weights them for artifact mitigation.Therefore,we drive the filtered reconstruction as closely to the uncompressed PCA as possible.To this end,we devise an attribute artifact reduction network(ARNet)consisting of two consecutive processing phases:MPSOs derivation and MPSOs combination.The MPSOs derivation uses a two-stream network to model local neighborhood variations from direct spatial embedding and frequency-dependent embedding,where sparse convolutions are utilized to best aggregate information from sparsely and irregularly distributed points.The MPSOs combination is guided by the least-squares error metric to derive weighting coefficients on the fly to further capture the content dynamics of the input PCAs.ARNet is implemented as an in-loop filtering tool for GPCC,where the linear weighting coefficients are encapsulated into the bitstream with negligible bitrate overhead.The experimental results demonstrate significant improvements over the latest G-PCC both subjectively and objectively.For example,our method offers a 22.12%YUV Bjøntegaard delta rate(BDRate)reduction compared to G-PCC across various commonly used test point clouds.Compared with a recent study showing state-of-the-art performance,our work not only gains 13.23%YUV BD-Rate but also provides a 30×processing speedup.展开更多
Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine ...Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine these societies as consisting of a collection of multimodal neural networks,including large language models,which engage in a“mindstorm”to solve problems using a shared natural language interface.Here,we work to identify and discuss key questions about the social structure,governance,and economic principles for NLSOMs,emphasizing their impact on the future of AI.Our demonstrations with NLSOMs—which feature up to 129 agents—show their effectiveness in various tasks,including visual question answering,image captioning,and prompt generation for text-to-image synthesis.展开更多
This paper presents a multi-task gradual inference model,MTGINet,for automatic portrait matting.It handles the subtasks of automatic portrait matting,namely portrait–transition–background trimap segmentation and tra...This paper presents a multi-task gradual inference model,MTGINet,for automatic portrait matting.It handles the subtasks of automatic portrait matting,namely portrait–transition–background trimap segmentation and transition region matting,with a single encoder–decoder structure.First,we enrich the highest stage of features from the encoder with portrait shape context via a shape context aggregation(SCA)module for trimap segmentation.Then,we fuse the SCA-enhanced features with detailed clues from the encoder for transition-region-aware alpha matting.The gradual inference model naturally allows sufficient interaction between the subtasks via forward computation and backwards propagation during training,and therefore achieves high accuracy while maintaining low complexity.In addition,considering the discrepancies in feature requirements across subtasks,we adapt the features from the encoders before reusing them via a feature rectification module.In addition to the MTGINet model,we have constructed a new large-scale dataset,HPM-17K,for half-body portrait matting.It consists of 16,967 images with diverse backgrounds.Comparative experiments with existing deep models on the public P3M-10K dataset and our HPM-17K dataset demonstrate that the proposed model exhibits state-of-the-art performance.展开更多
Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose...Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose LucIE,a novel unsupervised language-guided local image editing method for fashion images.LucIE adopts and modifies recent text-to-image synthesis network,DF-GAN,as its backbone.However,the synthesis backbone often changes the global structure of the input image,making local image editing impractical.To increase structural consistency between input and edited images,we propose Content-Preserving Fusion Module(CPFM).Different from existing fusion modules,CPFM prevents iterative refinement on visual feature maps and accumulates additive modifications on RGB maps.LucIE achieves local image editing explicitly with language-guided image segmentation and maskguided image blending while only using image and text pairs.Results on the DeepFashion dataset shows that LucIE achieves state-of-the-art results.Compared with previous methods,images generated by LucIE also exhibit fewer artifacts.We provide visualizations and perform ablation studies to validate LucIE and the CPFM.We also demonstrate and analyze limitations of LucIE,to provide a better understanding of LucIE.展开更多
Low-rank tensor completion(LRTC)restores missing elements in multidimensional visual data;the challenge is representing the inherent structures within this data.Typical methods either suffer from inefficiency owing to...Low-rank tensor completion(LRTC)restores missing elements in multidimensional visual data;the challenge is representing the inherent structures within this data.Typical methods either suffer from inefficiency owing to the combination of multiple regularizers or perform suboptimally using inappropriate priors.In this study,we further investigated LRTC using tensor singular value decomposition(t-SVD).Inspired by the tensor-tensor product(t-product),we proposed a unified transformed t-SVD method that employs an invertible linear transform with a unitary transform matrix.However,the t-SVD-based framework lacks the flexibility necessary to represent different inherent relations along the tensor modes.To address this issue,we propose a tensor represented by a series of multidimensional unfolding tensors to fully explore the hidden structure of the original data.Furthermore,the proposed model can be solved efficiently using the alternate-direction method of the multiplier(ADMM)approach.Extensive experimental results on multidimensional visual data(multispectral images,hyperspectral images,and videos)demonstrated the superiority of the proposed method over other state-of-the-art LRTC-related methods.展开更多
We propose a normalizing flow based on the wavelet framework for super-resolution(SR)called WDFSR.It learns the conditional distribution mapping between low-resolution images in the RGB domain and high-resolution imag...We propose a normalizing flow based on the wavelet framework for super-resolution(SR)called WDFSR.It learns the conditional distribution mapping between low-resolution images in the RGB domain and high-resolution images in the wavelet domain to simultaneously generate high-resolution images of different styles.To address the issue of some flowbased models being sensitive to datasets,which results in training fluctuations that reduce the mapping ability of the model and weaken generalization,we designed a method that combines a T-distribution and QR decomposition layer.Our method alleviates this problem while maintaining the ability of the model to map different distributions and produce higher-quality images.Good contextual conditional features can promote model training and enhance the distribution mapping capabilities for conditional distribution mapping.Therefore,we propose a Refinement layer combined with an attention mechanism to refine and fuse the extracted condition features to improve image quality.Extensive experiments on several SR datasets demonstrate that WDFSR outperforms most general CNN-and flow-based models in terms of PSNR value and perception quality.We also demonstrated that our framework works well for other low-level vision tasks,such as low-light enhancement.The pretrained models and source code with guidance for reference are available at https://github.com/Lisbegin/WDFSR.展开更多
Large models have accelerated the development of intelligent interpretation in remote sensing.Many remote sensing foundation models(RSFM)have emerged in recent years,sparking a new wave of deep learning in this field....Large models have accelerated the development of intelligent interpretation in remote sensing.Many remote sensing foundation models(RSFM)have emerged in recent years,sparking a new wave of deep learning in this field.Fine-tuning techniques serve as a bridge between remote sensing downstream tasks and advanced foundation models.As RSFMs become more powerful,fine-tuning techniques are expected to lead the next research frontier in numerous critical remote sensing applications.Advanced fine-tuning techniques can reduce the data and computational resource requirements during the downstream adaptation process.Current fine-tuning techniques for remote sensing are still in their early stages,leaving a large space for optimization and application.To elucidate the current development and future trends of remote sensing fine-tuning techniques,this survey offers a comprehensive overview of recent research.Specifically,this survey summarizes the applications and innovations of each work and categorizes recent remote sensing fine-tuning techniques into six types:adapter-based,prompt-based,reparameterization-based,hybrid methods,partial tuning,and improved tuning.展开更多
Generating emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as express...Generating emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for lip-sync accuracy. Prevailing generative works usually struggle to juggle to generate subtle variations of emotional expression and lip-synchronized talking. To address these challenges, we suggest modeling the implicit and explicit correlations between audio and emotional talking faces with a unified framework. As human emotional expressions usually present subtle and implicit relations with speech audio, we propose incorporating audio and emotional style embeddings into the diffusion-based generation process, for realistic generation while concentrating on emotional expressions. We then propose lip-based explicit correlation learning to construct a strong mapping of audio to lip motions, assuring lip-audio synchronization. Furthermore, we deploy a video-to-video rendering module to transfer expressions and lip motions from a proxy 3D avatar to an arbitrary portrait. Both quantitatively and qualitatively, MagicTalk outperforms state-of-the-art methods in terms of expressiveness, lip-sync, and perceptual quality.展开更多
The task of detecting three-dimensional objects using only RGB images presents a considerable challenge within the domain of computer vision.The core issue lies in accurately performing epipolar geometry matching betw...The task of detecting three-dimensional objects using only RGB images presents a considerable challenge within the domain of computer vision.The core issue lies in accurately performing epipolar geometry matching between multiple views to obtain latent geometric priors.Existing methods establish correspondences along epipolar line features in voxel space through various layers of convolution.However,this step often occurs in the later stages of the network,which limits overall performance.To address this challenge,we introduce a novel framework,ImVoxelENet,that integrates a geometric epipolar constraint.We start from the back-projection of pixel-wise features and design an attention mechanism that captures the relationship between forward and backward features along the ray for multiple views.This approach enables the early establishment of geometric correspondences and structural connections between epipolar lines.Using ScanNetV2 as a benchmark,extensive comparative and ablation experiments demonstrate that our proposed network achieves a 1.1%improvement in mAP,highlighting its effectiveness in enhancing 3D object detection performance.Our code is available at https://github.com/xug-coder/ImVoxelENet.展开更多
文摘The Computational Visual Media(CVM)conference series provides a leading international forum for the exchange of innovative research ideas and significant computational methodologies that both underpin and advance visual media.Its primary mission is to foster cross-disciplinary research that integrates computer graphics,computer vision,machine learning,image and video processing,visualization,and geometric computing.Topics of particular interest include classification,composition,retrieval,synthesis,cognition,and understanding of visual media,encompassing images,video,and 3D geometry.
文摘Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply combining multiple 3D datasets for pretraining a 3D backbone does not yield significant improvement,due to the domain discrepancies among different 3D datasets that impede effective feature learning.In this work,we identify the main sources of the domain discrepancies between 3D indoor scene datasets,and propose Swin3d++,an enhanced architecture based on Swin3d for efficient pretraining on multi-source 3D point clouds.Swin3d++introduces domain-specific mechanisms to SWIN3D's modules to address domain discrepancies and enhance the network capability on multi-source pretraining.Moreover,we devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining.We validate the effectiveness of our design,and demonstrate that Swin3d++surpasses the state-of-the-art 3D pretraining methods on typical indoor scene understanding tasks.
基金supported by the National Science Foundation under Grant No.1755695.
文摘Estimating 3D human pose from 2D images in real world contexts remains a challenge,characterized by unique data constraints.Large general datasets of motion-captured 3D adult human poses paired with 2D images exist,but in many application settings,collection of further motion-captured data is impossible,precluding a straightforward fine-tuning approach to adaptation.We present a method for improving 3D pose estimation transfer learning to domains where there are only depth camera images available as supervision.Our heuristic weakly supervised 3D human pose(HW-HuP)estimation method learns partial pose priors from general 3D human pose datasets and employs weak supervision with depth data to guide learning in an optimization and regression cycle.We show that HW-HuP meaningfully improves upon state-of-the-art models in the adult in-bed setting,as well as on large scale public 3D human pose datasets,under comparable supervision conditions.Our model code and data are publicly available at https://github.com/ostadabbas/hw-hup.A significantly expanded version of this paper,with supplementary material,is available as a preprint on arXiv at https://arxiv.org/abs/2105.10996.
基金supported by the National Natural Science Foundation for Young Scientists of China Award(No.62106289).
文摘Semantic image synthesis aims to generate highquality images given semantic conditions,i.e.,segmentation masks and style reference images.Existing methods widely adopt generative adversarial networks(GANs).GANs take all conditional inputs and directly synthesize images in a single forward step.In this paper,semantic image synthesis is treated as an image denoising task and is handled with a novel image-to-image diffusion model(IIDM).
基金supported by the Postdoctoral Fellowship Program of CPSF(GZC20240829).
文摘Diffusion-based models have recently achieved remarkable success in style transfer. However, when training data is scarce, existing methods struggle to effectively balance style and content. In this paper, we propose Style-Aware Diffusion (SAD), a novel method that harnesses efficient low-rank adaptation training techniques. Specifically, We extract latent representations of both style and content using DDIM inversion, formulated as an ordinary differential equation. Then, we use adaptive instance normalization and query–key–value injection to effectively align low-level style features with high-level content semantics. In addition, we propose parameter-efficient adaptation, which mitigates catastrophic forgetting and overfitting by rationally optimizing the weights of the attention layers, ensuring robust and effective performance, and achieving a 61.5% relative score increase over the plain model. The proposed method outperforms the high-performance DreamBooth-LoRA model and won the Fourth Jittor Artificial Intelligence Challenge. Our model is implemented using the Jittor framework and is available at https://github.com/liylo/jittor-qwqw-Few_Shot_Style_Transfer.
基金supported by the National Natural Science Foundation of China(Nos.U22B2055,62273345,and 62222302)the Beijing Natural Science Foundation(No.L223003)a Key R&D Project of Henan Province(No.231111210300).
文摘Learning-based multiple view stereo has gained significant attention recently.However,most methods rely on direct network supervision using provided ground-truth depth,which poses three inherent problems:resolution-dependent ground-truth artifacts,excessively challenging training examples(with relatively featureless textures),and use of less-viewed reference pixels for supervision,all of which hinder network optimization.To alleviate these problems,we propose an accurate network supervision paradigm that includes a ground-truth mask,an entropy mask,and a consistency mask,which provide more accurate supervision signals to aid network optimization.
基金supported by the National Natural Science Foundation of China under Grant Nos.62171038,61827901,and 62088101.
文摘Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturation from the original visible image.However,for fast VIF methods,this operation accounts for the majority of the calculation and is the bottleneck preventing faster processing.In this paper,we propose a fast fusion method,FCDFusion,with little color deviation.It preserves color information without color space transformations,by directly operating in RGB color space.It incorporates gamma correction at little extra cost,allowing color and contrast to be rapidly improved.We regard the fusion process as a scaling operation on 3D color vectors,greatly simplifying the calculations.A theoretical analysis and experiments show that our method can achieve satisfactory results in only 7 FLOPs per pixel.Compared to state-of-theart fast,color-preserving methods using HSV color space,our method provides higher contrast at only half of the computational cost.We further propose a new metric,color deviation,to measure the ability of a VIF method to preserve color.It is specifically designed for VIF tasks with color visible-light images,and overcomes deficiencies of existing VIF metrics used for this purpose.Our code is available at https://github.com/HeasonLee/FCDFusion.
文摘Large-scale unsupervised semantic segmentation(LUSS)is a sophisticated process that aims to segment similar areas within an image without relying on labeled training data.While existing methodologies have made substantial progress in this area,there is ample scope for enhancement.We thus introduce the PASS-SAM model,a comprehensive solution that amalgamates the benefits of various models to improve segmentation performance.
基金supported by the National Natural Science Foundation of China(U23A20311,62421003).
文摘While recent Gaussian-based SLAM methods achieve photorealistic reconstruction from RGB-D data, their computational performance remains a critical bottleneck. State-of-the-art techniques operate at less than 20 fps, significantly lagging behind geometry-based approaches like KinectFusion (hundreds of fps). This limitation stems from the heavy computational burden: modeling scenes requires numerous Gaussians and complex iterative optimization to fit RGB-D data;insufficient Gaussian counts or optimization iterations cause severe quality degradation. To address this, we propose a Gaussian-SDF hybrid representation, combining a colorized signed distance field (SDF) for smooth geometry and appearance with 3D Gaussians to capture underrepresented details. The SDF is efficiently constructed via RGB-D fusion (as in geometry-based methods), while Gaussians undergo iterative optimization. Our representation enables significant Gaussian reduction (50% fewer) by avoiding full-scene Gaussian modeling, and efficient Gaussian optimization (75% fewer iterations) through targeted appearance refinement. Building upon this representation, we develop GPS-SLAM (Gaussian-plus-SDF SLAM), a real-time 3D reconstruction system achieving over 150 fps on real-world Azure Kinect sequences, faster by an order-of-magnitude than state-of-the-art techniques while maintaining comparable reconstruction quality. The source code and data are available at https://gapszju.github.io/GPS-SLAM.
基金supported in part by the National Natural Science Foundation of China(62372012)the Deutsche Forschungsgemeinschaft(DFG,German Research Foundation)-Project-ID 251654672-TRR 161.
文摘We introduce continuous indexed points for improved multivariate volume visualization.Indexed points represent linear structures in parallel coordinates and can be used to encode local correlation of multivariate(including multi-field,multifaceted,and multi-attribute)volume data.First,we perform local linear fitting in the spatial neighborhood of each volume sample using principal component analysis,accelerated by hierarchical spatial data structures.This local linear information is then visualized as continuous indexed points in parallel coordinates:a density representation of indexed points in a continuous domain.With our new method,multivariate volume data can be analyzed using eigenvector information from local spatial embeddings.We utilize both 1-flat and 2-flat indexed points,allowing us to identify correlations between two variables and even three variables,respectively.An interactive occlusion shading model facilitates good spatial perception of the volume rendering of volumetric correlation characteristics.Interactive exploration is supported by specifically designed multivariate transfer function widgets working in the image plane of parallel coordinates.We show that our generic technique works for multi-attribute datasets.The effectiveness and usefulness of our new method is demonstrated through a case study,an expert user study,and domain expert feedback.
文摘The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach.
基金supported by the National Natural Science Foundation of China under Grant No.62171174National Key R&D Program of China under Grant No.2023YFC3706605the Open Project of Zhejiang Provincial Key Laboratory of Information Processing,Communication,and Networking.
文摘A learning-based adaptive loop filter is developed for the geometry-based point-cloud compression(G-PCC)standard to reduce attribute compression artifacts.The proposed method first generates multiple most probable sample offsets(MPSOs)as potential compression distortion approximations,and then linearly weights them for artifact mitigation.Therefore,we drive the filtered reconstruction as closely to the uncompressed PCA as possible.To this end,we devise an attribute artifact reduction network(ARNet)consisting of two consecutive processing phases:MPSOs derivation and MPSOs combination.The MPSOs derivation uses a two-stream network to model local neighborhood variations from direct spatial embedding and frequency-dependent embedding,where sparse convolutions are utilized to best aggregate information from sparsely and irregularly distributed points.The MPSOs combination is guided by the least-squares error metric to derive weighting coefficients on the fly to further capture the content dynamics of the input PCAs.ARNet is implemented as an in-loop filtering tool for GPCC,where the linear weighting coefficients are encapsulated into the bitstream with negligible bitrate overhead.The experimental results demonstrate significant improvements over the latest G-PCC both subjectively and objectively.For example,our method offers a 22.12%YUV Bjøntegaard delta rate(BDRate)reduction compared to G-PCC across various commonly used test point clouds.Compared with a recent study showing state-of-the-art performance,our work not only gains 13.23%YUV BD-Rate but also provides a 30×processing speedup.
基金supported by the European Research Council(ERC,Advanced Grant Number 742870the Swiss National Science Foundation(SNF,Grant Numbers 200021 and 192356)the National Natural Science Foundation of China(Grant Number 62476143).
文摘Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine these societies as consisting of a collection of multimodal neural networks,including large language models,which engage in a“mindstorm”to solve problems using a shared natural language interface.Here,we work to identify and discuss key questions about the social structure,governance,and economic principles for NLSOMs,emphasizing their impact on the future of AI.Our demonstrations with NLSOMs—which feature up to 129 agents—show their effectiveness in various tasks,including visual question answering,image captioning,and prompt generation for text-to-image synthesis.
基金supported by the National Natural Science Foundation of China(Nos.62176010 and 61771026).
文摘This paper presents a multi-task gradual inference model,MTGINet,for automatic portrait matting.It handles the subtasks of automatic portrait matting,namely portrait–transition–background trimap segmentation and transition region matting,with a single encoder–decoder structure.First,we enrich the highest stage of features from the encoder with portrait shape context via a shape context aggregation(SCA)module for trimap segmentation.Then,we fuse the SCA-enhanced features with detailed clues from the encoder for transition-region-aware alpha matting.The gradual inference model naturally allows sufficient interaction between the subtasks via forward computation and backwards propagation during training,and therefore achieves high accuracy while maintaining low complexity.In addition,considering the discrepancies in feature requirements across subtasks,we adapt the features from the encoders before reusing them via a feature rectification module.In addition to the MTGINet model,we have constructed a new large-scale dataset,HPM-17K,for half-body portrait matting.It consists of 16,967 images with diverse backgrounds.Comparative experiments with existing deep models on the public P3M-10K dataset and our HPM-17K dataset demonstrate that the proposed model exhibits state-of-the-art performance.
文摘Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose LucIE,a novel unsupervised language-guided local image editing method for fashion images.LucIE adopts and modifies recent text-to-image synthesis network,DF-GAN,as its backbone.However,the synthesis backbone often changes the global structure of the input image,making local image editing impractical.To increase structural consistency between input and edited images,we propose Content-Preserving Fusion Module(CPFM).Different from existing fusion modules,CPFM prevents iterative refinement on visual feature maps and accumulates additive modifications on RGB maps.LucIE achieves local image editing explicitly with language-guided image segmentation and maskguided image blending while only using image and text pairs.Results on the DeepFashion dataset shows that LucIE achieves state-of-the-art results.Compared with previous methods,images generated by LucIE also exhibit fewer artifacts.We provide visualizations and perform ablation studies to validate LucIE and the CPFM.We also demonstrate and analyze limitations of LucIE,to provide a better understanding of LucIE.
基金supported by the Zhejiang Lab Key Research Project(No.G2021NB0AL03)Zhejiang Lab Youth Foundation Project(No.K2023NB0AA03).
文摘Low-rank tensor completion(LRTC)restores missing elements in multidimensional visual data;the challenge is representing the inherent structures within this data.Typical methods either suffer from inefficiency owing to the combination of multiple regularizers or perform suboptimally using inappropriate priors.In this study,we further investigated LRTC using tensor singular value decomposition(t-SVD).Inspired by the tensor-tensor product(t-product),we proposed a unified transformed t-SVD method that employs an invertible linear transform with a unitary transform matrix.However,the t-SVD-based framework lacks the flexibility necessary to represent different inherent relations along the tensor modes.To address this issue,we propose a tensor represented by a series of multidimensional unfolding tensors to fully explore the hidden structure of the original data.Furthermore,the proposed model can be solved efficiently using the alternate-direction method of the multiplier(ADMM)approach.Extensive experimental results on multidimensional visual data(multispectral images,hyperspectral images,and videos)demonstrated the superiority of the proposed method over other state-of-the-art LRTC-related methods.
基金grateful to Zhejiang Gongshang University for its valuable computing resources and outstanding laboratory facilities,and support from the National Natural Science Foundation of China(Grant No.62172366)the Zhejiang Provincial Natural Science Foundation of China(Grant No.LY22F020013)+1 种基金“Pioneer”and“Leading Goose”R&D Program of Zhejiang Province(Grant No.2023C01150),Major Sci-Tech Innovation Project of Hangzhou City(Grant No.2022AIZD0110)“Digital+”Discipline Construction Project of Zhejiang Gongshang University(Grant No.SZJ2022B009).
文摘We propose a normalizing flow based on the wavelet framework for super-resolution(SR)called WDFSR.It learns the conditional distribution mapping between low-resolution images in the RGB domain and high-resolution images in the wavelet domain to simultaneously generate high-resolution images of different styles.To address the issue of some flowbased models being sensitive to datasets,which results in training fluctuations that reduce the mapping ability of the model and weaken generalization,we designed a method that combines a T-distribution and QR decomposition layer.Our method alleviates this problem while maintaining the ability of the model to map different distributions and produce higher-quality images.Good contextual conditional features can promote model training and enhance the distribution mapping capabilities for conditional distribution mapping.Therefore,we propose a Refinement layer combined with an attention mechanism to refine and fuse the extracted condition features to improve image quality.Extensive experiments on several SR datasets demonstrate that WDFSR outperforms most general CNN-and flow-based models in terms of PSNR value and perception quality.We also demonstrated that our framework works well for other low-level vision tasks,such as low-light enhancement.The pretrained models and source code with guidance for reference are available at https://github.com/Lisbegin/WDFSR.
基金supported by the National Natural Science Foundation of China(62495061,62495064,and 62476143)the Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology,and the Shuimu Tsinghua Scholar Program.
文摘Large models have accelerated the development of intelligent interpretation in remote sensing.Many remote sensing foundation models(RSFM)have emerged in recent years,sparking a new wave of deep learning in this field.Fine-tuning techniques serve as a bridge between remote sensing downstream tasks and advanced foundation models.As RSFMs become more powerful,fine-tuning techniques are expected to lead the next research frontier in numerous critical remote sensing applications.Advanced fine-tuning techniques can reduce the data and computational resource requirements during the downstream adaptation process.Current fine-tuning techniques for remote sensing are still in their early stages,leaving a large space for optimization and application.To elucidate the current development and future trends of remote sensing fine-tuning techniques,this survey offers a comprehensive overview of recent research.Specifically,this survey summarizes the applications and innovations of each work and categorizes recent remote sensing fine-tuning techniques into six types:adapter-based,prompt-based,reparameterization-based,hybrid methods,partial tuning,and improved tuning.
文摘Generating emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for lip-sync accuracy. Prevailing generative works usually struggle to juggle to generate subtle variations of emotional expression and lip-synchronized talking. To address these challenges, we suggest modeling the implicit and explicit correlations between audio and emotional talking faces with a unified framework. As human emotional expressions usually present subtle and implicit relations with speech audio, we propose incorporating audio and emotional style embeddings into the diffusion-based generation process, for realistic generation while concentrating on emotional expressions. We then propose lip-based explicit correlation learning to construct a strong mapping of audio to lip motions, assuring lip-audio synchronization. Furthermore, we deploy a video-to-video rendering module to transfer expressions and lip motions from a proxy 3D avatar to an arbitrary portrait. Both quantitatively and qualitatively, MagicTalk outperforms state-of-the-art methods in terms of expressiveness, lip-sync, and perceptual quality.
文摘The task of detecting three-dimensional objects using only RGB images presents a considerable challenge within the domain of computer vision.The core issue lies in accurately performing epipolar geometry matching between multiple views to obtain latent geometric priors.Existing methods establish correspondences along epipolar line features in voxel space through various layers of convolution.However,this step often occurs in the later stages of the network,which limits overall performance.To address this challenge,we introduce a novel framework,ImVoxelENet,that integrates a geometric epipolar constraint.We start from the back-projection of pixel-wise features and design an attention mechanism that captures the relationship between forward and backward features along the ray for multiple views.This approach enables the early establishment of geometric correspondences and structural connections between epipolar lines.Using ScanNetV2 as a benchmark,extensive comparative and ablation experiments demonstrate that our proposed network achieves a 1.1%improvement in mAP,highlighting its effectiveness in enhancing 3D object detection performance.Our code is available at https://github.com/xug-coder/ImVoxelENet.