期刊文献+

为您找到了以下期刊:

共找到458篇文章
< 1 2 23 >
每页显示 20 50 100
IIDM:Image-to-image diffusion model for semantic image synthesis
1
作者 Feng Liu Xiaobin Chang computational visual media 2025年第2期423-429,共7页
Semantic image synthesis aims to generate highquality images given semantic conditions,i.e.,segmentation masks and style reference images.Existing methods widely adopt generative adversarial networks(GANs).GANs take a... Semantic image synthesis aims to generate highquality images given semantic conditions,i.e.,segmentation masks and style reference images.Existing methods widely adopt generative adversarial networks(GANs).GANs take all conditional inputs and directly synthesize images in a single forward step.In this paper,semantic image synthesis is treated as an image denoising task and is handled with a novel image-to-image diffusion model(IIDM). 展开更多
关键词 generative adversarial networks semantic image synthesis image synthesis directly synthesize images image image diffusion model style reference imagesexisting generative adversarial networks gans gans image denoising task
原文传递
FCDFusion: A fast, low color deviation method for fusing visible and infrared image pairs
2
作者 Hesong Li Ying Fu computational visual media 2025年第1期195-211,共17页
Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturat... Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturation from the original visible image.However,for fast VIF methods,this operation accounts for the majority of the calculation and is the bottleneck preventing faster processing.In this paper,we propose a fast fusion method,FCDFusion,with little color deviation.It preserves color information without color space transformations,by directly operating in RGB color space.It incorporates gamma correction at little extra cost,allowing color and contrast to be rapidly improved.We regard the fusion process as a scaling operation on 3D color vectors,greatly simplifying the calculations.A theoretical analysis and experiments show that our method can achieve satisfactory results in only 7 FLOPs per pixel.Compared to state-of-theart fast,color-preserving methods using HSV color space,our method provides higher contrast at only half of the computational cost.We further propose a new metric,color deviation,to measure the ability of a VIF method to preserve color.It is specifically designed for VIF tasks with color visible-light images,and overcomes deficiencies of existing VIF metrics used for this purpose.Our code is available at https://github.com/HeasonLee/FCDFusion. 展开更多
关键词 infrared images visible and infrared image fusion(VIF) gamma correction real-time display color metrics color deviation
原文传递
PASS-SAM:Integration of Segment Anything Model for Large-Scale Unsupervised Semantic Segmentation
3
作者 Yin Tang Rui Chen +1 位作者 Gensheng Pei Qiong Wang computational visual media 2025年第3期669-674,共6页
Large-scale unsupervised semantic segmentation(LUSS)is a sophisticated process that aims to segment similar areas within an image without relying on labeled training data.While existing methodologies have made substan... Large-scale unsupervised semantic segmentation(LUSS)is a sophisticated process that aims to segment similar areas within an image without relying on labeled training data.While existing methodologies have made substantial progress in this area,there is ample scope for enhancement.We thus introduce the PASS-SAM model,a comprehensive solution that amalgamates the benefits of various models to improve segmentation performance. 展开更多
关键词 segmentation performance amalgamates benefits various models segment anything model pass sam model segment similar areas large scale unsupervised semantic segmentation
原文传递
Swin3D: A pretrained transformer backbone for 3D indoor scene understanding
4
作者 Yu-Qi Yang Yu-Xiao Guo +5 位作者 Jian-Yu Xiong Yang Liu Hao Pan Peng-Shuai Wang Xin Tong Baining Guo computational visual media 2025年第1期83-101,共19页
The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,call... The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach. 展开更多
关键词 3D pretraining ponitcloud analysis trans-former backbone Swin Transformer 3D semantic segmentation 3D object detection
原文传递
ARNet:Attribute artifact reduction for G-PCC compressed point clouds
5
作者 Junzhe Zhang Junteng Zhang +1 位作者 Dandan Ding Zhan Ma computational visual media 2025年第2期327-342,共16页
A learning-based adaptive loop filter is developed for the geometry-based point-cloud compression(G-PCC)standard to reduce attribute compression artifacts.The proposed method first generates multiple most probable sam... A learning-based adaptive loop filter is developed for the geometry-based point-cloud compression(G-PCC)standard to reduce attribute compression artifacts.The proposed method first generates multiple most probable sample offsets(MPSOs)as potential compression distortion approximations,and then linearly weights them for artifact mitigation.Therefore,we drive the filtered reconstruction as closely to the uncompressed PCA as possible.To this end,we devise an attribute artifact reduction network(ARNet)consisting of two consecutive processing phases:MPSOs derivation and MPSOs combination.The MPSOs derivation uses a two-stream network to model local neighborhood variations from direct spatial embedding and frequency-dependent embedding,where sparse convolutions are utilized to best aggregate information from sparsely and irregularly distributed points.The MPSOs combination is guided by the least-squares error metric to derive weighting coefficients on the fly to further capture the content dynamics of the input PCAs.ARNet is implemented as an in-loop filtering tool for GPCC,where the linear weighting coefficients are encapsulated into the bitstream with negligible bitrate overhead.The experimental results demonstrate significant improvements over the latest G-PCC both subjectively and objectively.For example,our method offers a 22.12%YUV Bjøntegaard delta rate(BDRate)reduction compared to G-PCC across various commonly used test point clouds.Compared with a recent study showing state-of-the-art performance,our work not only gains 13.23%YUV BD-Rate but also provides a 30×processing speedup. 展开更多
关键词 point cloud attribute compression sparse convolution sample offset linear coefficient
原文传递
Mindstorms in natural language-based societies of mind
6
作者 Mingchen Zhuge Haozhe Liu +23 位作者 Francesco Faccio Dylan R.Ashley Róbert Csordás Anand Gopalakrishnan Abdullah Hamdi Hasan Abed Al Kader Hammoud Vincent Herrmann Kazuki Irie Louis Kirsch Bing Li Guohao Li Shuming Liu Jinjie Mai Piotr Piękos Aditya A.Ramesh Imanol Schlag Weimin Shi Aleksandar Stanić Wenyi Wang Yuhui Wang Mengmeng Xu Deng-Ping Fan Bernard Ghanem Jürgen Schmidhuber computational visual media 2025年第1期29-81,共53页
Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine ... Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine these societies as consisting of a collection of multimodal neural networks,including large language models,which engage in a“mindstorm”to solve problems using a shared natural language interface.Here,we work to identify and discuss key questions about the social structure,governance,and economic principles for NLSOMs,emphasizing their impact on the future of AI.Our demonstrations with NLSOMs—which feature up to 129 agents—show their effectiveness in various tasks,including visual question answering,image captioning,and prompt generation for text-to-image synthesis. 展开更多
关键词 mindstorm society of mind(SOM) large languagemodels(LLMs) multimodal learning learning to think
原文传递
LucIE: Language-guided local image editing for fashion images
7
作者 Huanglu Wen Shaodi You Ying Fu computational visual media 2025年第1期179-194,共16页
Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose... Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose LucIE,a novel unsupervised language-guided local image editing method for fashion images.LucIE adopts and modifies recent text-to-image synthesis network,DF-GAN,as its backbone.However,the synthesis backbone often changes the global structure of the input image,making local image editing impractical.To increase structural consistency between input and edited images,we propose Content-Preserving Fusion Module(CPFM).Different from existing fusion modules,CPFM prevents iterative refinement on visual feature maps and accumulates additive modifications on RGB maps.LucIE achieves local image editing explicitly with language-guided image segmentation and maskguided image blending while only using image and text pairs.Results on the DeepFashion dataset shows that LucIE achieves state-of-the-art results.Compared with previous methods,images generated by LucIE also exhibit fewer artifacts.We provide visualizations and perform ablation studies to validate LucIE and the CPFM.We also demonstrate and analyze limitations of LucIE,to provide a better understanding of LucIE. 展开更多
关键词 deep learning language-guided image editing local image editing content preservation fashion images
原文传递
Unified Transformed t-SVD Using Unfolding Tensors for Visual Inpainting
8
作者 Mengjie Qin Wen Wang +3 位作者 Honghui Xu Te Li Chunlong Zhang Minhong Wan computational visual media 2025年第3期549-568,共20页
Low-rank tensor completion(LRTC)restores missing elements in multidimensional visual data;the challenge is representing the inherent structures within this data.Typical methods either suffer from inefficiency owing to... Low-rank tensor completion(LRTC)restores missing elements in multidimensional visual data;the challenge is representing the inherent structures within this data.Typical methods either suffer from inefficiency owing to the combination of multiple regularizers or perform suboptimally using inappropriate priors.In this study,we further investigated LRTC using tensor singular value decomposition(t-SVD).Inspired by the tensor-tensor product(t-product),we proposed a unified transformed t-SVD method that employs an invertible linear transform with a unitary transform matrix.However,the t-SVD-based framework lacks the flexibility necessary to represent different inherent relations along the tensor modes.To address this issue,we propose a tensor represented by a series of multidimensional unfolding tensors to fully explore the hidden structure of the original data.Furthermore,the proposed model can be solved efficiently using the alternate-direction method of the multiplier(ADMM)approach.Extensive experimental results on multidimensional visual data(multispectral images,hyperspectral images,and videos)demonstrated the superiority of the proposed method over other state-of-the-art LRTC-related methods. 展开更多
关键词 tensor singular value decomposition(t-SVD) invertible linear transform unitary transform unfolding tensors tensor completion(TC)
原文传递
WDFSR: Normalizing flow based on the wavelet-domain for super-resolution
9
作者 Chao Song Shaobang Li +1 位作者 Frederick W.B.Li Bailin Yang computational visual media 2025年第2期381-404,共24页
We propose a normalizing flow based on the wavelet framework for super-resolution(SR)called WDFSR.It learns the conditional distribution mapping between low-resolution images in the RGB domain and high-resolution imag... We propose a normalizing flow based on the wavelet framework for super-resolution(SR)called WDFSR.It learns the conditional distribution mapping between low-resolution images in the RGB domain and high-resolution images in the wavelet domain to simultaneously generate high-resolution images of different styles.To address the issue of some flowbased models being sensitive to datasets,which results in training fluctuations that reduce the mapping ability of the model and weaken generalization,we designed a method that combines a T-distribution and QR decomposition layer.Our method alleviates this problem while maintaining the ability of the model to map different distributions and produce higher-quality images.Good contextual conditional features can promote model training and enhance the distribution mapping capabilities for conditional distribution mapping.Therefore,we propose a Refinement layer combined with an attention mechanism to refine and fuse the extracted condition features to improve image quality.Extensive experiments on several SR datasets demonstrate that WDFSR outperforms most general CNN-and flow-based models in terms of PSNR value and perception quality.We also demonstrated that our framework works well for other low-level vision tasks,such as low-light enhancement.The pretrained models and source code with guidance for reference are available at https://github.com/Lisbegin/WDFSR. 展开更多
关键词 normalizing flow super-resolution(SR) wavelet domain attention mechanism generative model
原文传递
3D Indoor Scene Geometry Estimation from a Single Omnidirectional Image:A Comprehensive Survey
10
作者 Ming Meng Yonggui Zhu +2 位作者 Yufei Zhao Zhaoxin Li Zhe Zhu computational visual media 2025年第3期431-464,共34页
This paper surveys the technology used in three-dimensional indoor scene geometry estimation from a single 360°omnidirectional image,which is pivotal in extracting 3D structural information from indoor environmen... This paper surveys the technology used in three-dimensional indoor scene geometry estimation from a single 360°omnidirectional image,which is pivotal in extracting 3D structural information from indoor environments.The technology transforms omnidirectional data into a 3D model,depicting spatial structure,object positions,and scene layout.Its significance spans various domains,including virtual reality(VR),augmented reality(AR),mixed reality(MR),game development,urban planning,and robot navigation.We begin by revisiting foundational concepts of omnidirectional imaging and detailing the problems,applications,and challenges in this field.Our review categorizes the fundamental tasks of structure recovery,depth estimation,and layout recovery.We also review pertinent datasets and evaluation metrics,providing the latest research as a reference.Finally,we summarize the field and discuss potential future trends to inform and guide further research. 展开更多
关键词 3D scene geometry omnidirectional images structure recovery depth estimation layout recovery
原文传递
Weakly Supervised Instance Action Recognition
11
作者 Haomin Yan Ruize Han +2 位作者 Wei Feng Jiewen Zhao Song Wang computational visual media 2025年第3期603-618,共16页
We study the novel problem of weakly supervised instance action recognition(WSiAR)in multi-person(crowd)scenes.We specifically aim to recognize the action of each subject in the crowd,for which we propose the use of a... We study the novel problem of weakly supervised instance action recognition(WSiAR)in multi-person(crowd)scenes.We specifically aim to recognize the action of each subject in the crowd,for which we propose the use of a weakly supervised method,considering the expense of large-scale annotations for training.This problem is of great practical value for video surveillance and sports scene analysis.To this end,we investigated and designed a series of weak annotations for the supervision of weakly supervised instance action recognition(WSiAR).We propose two categories of weak label settings,bag labels and sparse labels,to significantly reduce the number of labels.Based on the former,we propose a novel sub-block-aware multi-instance learning(MIL)loss to obtain more effective information from weak labels during training.With respect to the latter,we propose a pseudo label generation strategy for extending sparse labels.This enables our method to achieve results comparable to those of fully supervised methods but with significantly fewer annotations.The experimental results on two benchmarks verified the rationality of the problem definition and effectiveness of the proposed weakly supervised training method in solving our problem. 展开更多
关键词 weak supervision instance action recognition crowd multi-person scene human activity
原文传递
Intuitive user-guided portrait image editing with asymmetric conditional GAN
12
作者 Linlin Liu Qian Fu +1 位作者 Fei Hou Ying He computational visual media 2025年第2期361-379,共19页
We propose PortraitACG,a novel framework for user-guided portrait image editing that leverages an asymmetric conditional generative adversarial network(GAN),which supports the fine-grained editing of geometries,colors... We propose PortraitACG,a novel framework for user-guided portrait image editing that leverages an asymmetric conditional generative adversarial network(GAN),which supports the fine-grained editing of geometries,colors,lights,and shadows using a single neural network model.Existing conditional GAN-based approaches usually feed the same conditional information into generators and discriminators,which is sub-optimal because these two modules are designed for different purposes.To facilitate flexible user-guided editing,we propose a novel asymmetric conditional GAN,where the generators take the transformed conditional inputs,such as edge maps,color palettes,sliders,and masks,that can be directly edited by the user,and the discriminators take the conditional inputs in a way that can guide controllable image generation more effectively.This allows image editing operations to be performed in a simpler and more intuitive manner.For example,the user can directly use a color palette to specify the desired colors of hair,skin,eyes,lips,and background and use a slider to blend colors.Moreover,users can edit the lights and shadows by modifying their corresponding masks. 展开更多
关键词 portrait images asymmetric conditional GAN fine-grained control color editing PALETTE
原文传递
Swin3D++:Effective Multi-Source Pretraining for 3D Indoor Scene Understanding
13
作者 Yu-Qi Yang Yu-Xiao Guo Yang Liu computational visual media 2025年第3期465-481,共17页
Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply... Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply combining multiple 3D datasets for pretraining a 3D backbone does not yield significant improvement,due to the domain discrepancies among different 3D datasets that impede effective feature learning.In this work,we identify the main sources of the domain discrepancies between 3D indoor scene datasets,and propose Swin3d++,an enhanced architecture based on Swin3d for efficient pretraining on multi-source 3D point clouds.Swin3d++introduces domain-specific mechanisms to SWIN3D's modules to address domain discrepancies and enhance the network capability on multi-source pretraining.Moreover,we devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining.We validate the effectiveness of our design,and demonstrate that Swin3d++surpasses the state-of-the-art 3D pretraining methods on typical indoor scene understanding tasks. 展开更多
关键词 3D scenes INDOOR pretraining multi-source data data augmentation
原文传递
Script-to-Storyboard: A new contextual retrieval dataset and benchmark
14
作者 Xi Tian Yong-Liang Yang Qi Wu computational visual media 2025年第1期103-122,共20页
Storyboards comprising key illustrations and images help filmmakers to outline ideas,key moments,and story events when filming movies.Inspired by this,we introduce the first contextual benchmark dataset Script-to-Stor... Storyboards comprising key illustrations and images help filmmakers to outline ideas,key moments,and story events when filming movies.Inspired by this,we introduce the first contextual benchmark dataset Script-to-Storyboard(Sc2St)composed of storyboards to explicitly express story structures in the movie domain,and propose the contextual retrieval task to facilitate movie story understanding.The Sc2St dataset contains fine-grained and diverse texts,annotated semantic keyframes,and coherent storylines in storyboards,unlike existing movie datasets.The contextual retrieval task takes as input a multi-sentence movie script summary with keyframe history and aims to retrieve a future keyframe described by a corresponding sentence to form the storyboard.Compared to classic text-based visual retrieval tasks,this requires capturing the context from the description(script)and keyframe history.We benchmark existing text-based visual retrieval methods on the new dataset and propose a recurrent-based framework with three variants for effective context encoding.Comprehensive experiments demonstrate that our methods compare favourably to existing methods;ablation studies validate the effectiveness of the proposed context encoding approaches. 展开更多
关键词 DATASET BENCHMARK text-based image retrieval MOVIE
原文传递
Noise-robust few-shot classification via variational adversarial data augmentation
15
作者 Renjie Xu Baodi Liu +3 位作者 Kai Zhang Honglong Chen Dapeng Tao Weifeng Liu computational visual media 2025年第1期227-239,共13页
Few-shot classification models trained with clean samples poorly classify samples from the real world with various scales of noise.To enhance the model for recognizing noisy samples,researchers usually utilize data au... Few-shot classification models trained with clean samples poorly classify samples from the real world with various scales of noise.To enhance the model for recognizing noisy samples,researchers usually utilize data augmentation or use noisy samples generated by adversarial training for model training.However,existing methods still have problems:(i)The effects of data augmentation on the robustness of the model are limited.(ii)The noise generated by adversarial training usually causes overfitting and reduces the generalization ability of the model,which is very significant for few-shot classification.(iii)Most existing methods cannot adaptively generate appropriate noise.Given the above three points,this paper proposes a noise-robust few-shot classification algorithm,VADA—Variational Adversarial Data Augmentation.Unlike existing methods,VADA utilizes a variational noise generator to generate an adaptive noise distribution according to different samples based on adversarial learning,and optimizes the generator by minimizing the expectation of the empirical risk.Applying VADA during training can make few-shot classification more robust against noisy data,while retaining generalization ability.In this paper,we utilize FEAT and ProtoNet as baseline models,and accuracy is verified on several common few-shot classification datasets,including MiniImageNet,TieredImageNet,and CUB.After training with VADA,the classification accuracy of the models increases for samples with various scales of noise. 展开更多
关键词 few-shot learning adversarial learning ROBUSTNESS variational method
原文传递
A biophysical-based skin model for heterogeneous volume rendering
16
作者 Qi Wang Fujun Luan +3 位作者 Yuxin Dai Yuchi Huo Hujun Bao Rui Wang computational visual media 2025年第2期289-303,共15页
Realistic human skin rendering has been a long-standing challenge in computer graphics.Recently,biophysical-based skin rendering has received increasing attention,as it provides a more realistic skin-rendering and a m... Realistic human skin rendering has been a long-standing challenge in computer graphics.Recently,biophysical-based skin rendering has received increasing attention,as it provides a more realistic skin-rendering and a more intuitive way to adjust the skin style.In this work,we present a novel heterogeneous biophysical-based volume rendering method for human skin that improves the realism of skin appearance while easily simulating various types of skin effects,including skin diseases,by modifying biological coefficient textures.Specifically,we introduce a two-layer skin representation by mesh deformation that explicitly models the epidermis and dermis with heterogeneous volumetric medium layers containing the corresponding spatially varying melanin and hemoglobin,respectively.Furthermore,to better facilitate skin acquisition,we introduced a learning-based framework that automatically estimates spatially varying biological coefficients from an albedo texture,enabling biophysical-based and intuitive editing,such as tanning,pathological vitiligo,and freckles.We illustrated the effects of multiple skinediting applications and demonstrated superior quality to the commonly used random walk skin-rendering method,with more convincing skin details regarding subsurface scattering. 展开更多
关键词 skin model RENDERING ray tracing BIOPHYSICS
原文传递
Addressing missing modality challenges in MRI images:A comprehensive review
17
作者 Reza Azad Mohammad Dehghanmanshadi +2 位作者 Nika Khosravi Julien Cohen-Adad Dorit Merhof computational visual media 2025年第2期241-268,共28页
Magnetic resonance imaging(MRI)is one of the most prevalent imaging modalities used for diagnosis,treatment planning,and outcome control in various medical conditions.MRI sequences provide physicians with the ability ... Magnetic resonance imaging(MRI)is one of the most prevalent imaging modalities used for diagnosis,treatment planning,and outcome control in various medical conditions.MRI sequences provide physicians with the ability to view and monitor tissues at multiple contrasts within a single scan and serve as input for automated systems to perform downstream tasks.However,in clinical practice,there is usually no concise set of identically acquired sequences for a whole group of patients.As a consequence,medical professionals and automated systems both face difficulties due to the lack of complementary information from such missing sequences.This problem is well known in computer vision,particularly in medical image processing tasks such as tumor segmentation,tissue classification,and image generation.With the aim of helping researchers,this literature review examines a significant number of recent approaches that attempt to mitigate these problems.Basic techniques such as early synthesis methods,as well as later approaches that deploy deep learning,such as common latent space models,knowledge distillation networks,mutual information maximization,and generative adversarial networks(GANs)are examined in detail.We investigate the novelty,strengths,and weaknesses of the aforementioned strategies.Moreover,using a case study on the segmentation task,our survey offers quantitative benchmarks to further analyze the effectiveness of these methods for addressing the missing modalities challenge.Furthermore,a discussion offers possible future research directions. 展开更多
关键词 missing modality SURVEY deep learning magnetic resonance imaging(MRI)
原文传递
MA2Net:Multi-Scale Adaptive Mixed Attention Network for Image Demoiréing
18
作者 Ji-Wei Wang Li-Yong Shen Hao-Nan Zhao computational visual media 2025年第3期619-634,共16页
Image demoiréing is a complex image-restoration task because of the color and shape variations of moirépatterns.With the development of mobile devices,mobile phones can now be used to capture images at multi... Image demoiréing is a complex image-restoration task because of the color and shape variations of moirépatterns.With the development of mobile devices,mobile phones can now be used to capture images at multiple resolutions.This difficulty increases when attempting to remove moiréfrom both low-and high-resolution images,as different resolutions make it challenging for existing methods to match the scales and textures of moiré.To solve these problems,we built a mixed attention residual module(MARM)by combining multi-scale feature extraction and mixed attention methods.Based on MARM,we propose a multi-scale adaptive mixed attention network(MA2Net)that can adapt to input images of different sizes and remove moiréof various shapes.Our model achieved the best results on four public datasets with resolutions ranging from 256×256 to 4k.Extensive experiments demonstrated the effectiveness of our model,which outperformed state-of-the-art methods by a large margin.We also conducted experiments on image deraining to validate the effectiveness of our model in other image-restoration tasks,and MA2Net achieved state-of-the-art performance on the Rain200H dataset. 展开更多
关键词 image demoiréing mixed attention multi-scale fusion deep learning
原文传递
Text to image generation with bidirectional Multiway Transformers
19
作者 Hangbo Bao Li Dong +1 位作者 Songhao Piao Furu Wei computational visual media 2025年第2期405-422,共18页
In this study,we explore the potential of Multiway Transformers for text-to-image generation to achieve performance improvements through a concise and efficient decoupled model design and the inference efficiency prov... In this study,we explore the potential of Multiway Transformers for text-to-image generation to achieve performance improvements through a concise and efficient decoupled model design and the inference efficiency provided by bidirectional encoding.We propose a method for improving the image tokenizer using pretrained Vision Transformers.Next,we employ bidirectional Multiway Transformers to restore the masked visual tokens combined with the unmasked text tokens.On the MS-COCO benchmark,our Multiway Transformers outperform vanilla Transformers,achieving superior FID scores and confirming the efficacy of the modality-specific parameter computation design.Ablation studies reveal that the fusion of visual and text tokens in bidirectional encoding contributes to improved model performance.Additionally,our proposed tokenizer outperforms VQGAN in image reconstruction quality and enhances the text-to-image generation results.By incorporating the additional CC-3M dataset for intermediate finetuning on our model with 688M parameters,we achieve competitive results with a finetuned FID score of 4.98 on MS-COCO. 展开更多
关键词 text to image generation VQ-VAE TRANSFORMER generative models
原文传递
Point Mask Transformer for Outdoor Point Cloud Semantic Segmentation
20
作者 Xiangqian Li Xin Tan +2 位作者 Zhizhong Zhang Yuan Xie Lizhuang Ma computational visual media 2025年第3期497-511,共15页
Current outdoor point-cloud segmentation methods typically formulate semantic segmentation as a per-point/voxel-classification task.Although this strategy is straightforward because it classifies each point directly,i... Current outdoor point-cloud segmentation methods typically formulate semantic segmentation as a per-point/voxel-classification task.Although this strategy is straightforward because it classifies each point directly,it ignores the overall relationship of the category.As an alternative paradigm,mask classification decouples category classification from region localization,allowing the model to better capture overall category relationships.In this paper,we propose a novel approach called the point mask transformer(PMFormer),which transforms the semantic segmentation of point clouds from per-point classification to mask classification using a transformer architecture.The proposed model comprises a 3D backbone,transformer decoder,and segmentation head that predicts a series of binary masks,each associated with a global class label.Furthermore,to accommodate the unique characteristics of large and sparse outdoor point-cloud scenes,we propose three enhancements for the integration of point-cloud data with the transformer:MaskMix,3D position encoding,and attention weights.We evaluate our model using the SemanticKITTI and nuScenes datasets.Our experimental results show that the proposed method outperforms state-of-the-art semantic segmentation approaches. 展开更多
关键词 point cloud deep learning semantic segmentation TRANSFORMER
原文传递
上一页 1 2 23 下一页 到第
使用帮助 返回顶部