期刊文献+

为您找到了以下期刊:

共找到484篇文章
< 1 2 25 >
每页显示 20 50 100
Message from vip Editors of the CVM 2025 Special Issue
1
作者 Piotr Didyk Junhui Hou computational visual media 2025年第4期675-676,共2页
The Computational Visual Media(CVM)conference series provides a leading international forum for the exchange of innovative research ideas and significant computational methodologies that both underpin and advance visu... The Computational Visual Media(CVM)conference series provides a leading international forum for the exchange of innovative research ideas and significant computational methodologies that both underpin and advance visual media.Its primary mission is to foster cross-disciplinary research that integrates computer graphics,computer vision,machine learning,image and video processing,visualization,and geometric computing.Topics of particular interest include classification,composition,retrieval,synthesis,cognition,and understanding of visual media,encompassing images,video,and 3D geometry. 展开更多
关键词 computer graphicscomputer visionmachine learningimage geometric computingtopics image processing computer graphics visual mediaits computational methodologies computer vision computational visual media cvm conference
原文传递
Swin3D++:Effective Multi-Source Pretraining for 3D Indoor Scene Understanding 被引量:1
2
作者 Yu-Qi Yang Yu-Xiao Guo Yang Liu computational visual media 2025年第3期465-481,共17页
Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply... Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply combining multiple 3D datasets for pretraining a 3D backbone does not yield significant improvement,due to the domain discrepancies among different 3D datasets that impede effective feature learning.In this work,we identify the main sources of the domain discrepancies between 3D indoor scene datasets,and propose Swin3d++,an enhanced architecture based on Swin3d for efficient pretraining on multi-source 3D point clouds.Swin3d++introduces domain-specific mechanisms to SWIN3D's modules to address domain discrepancies and enhance the network capability on multi-source pretraining.Moreover,we devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining.We validate the effectiveness of our design,and demonstrate that Swin3d++surpasses the state-of-the-art 3D pretraining methods on typical indoor scene understanding tasks. 展开更多
关键词 3D scenes INDOOR pretraining multi-source data data augmentation
原文传递
Heuristic weakly supervised 3D human pose estimation
3
作者 Shuangjun Liu Michael Wan Sarah Ostadabbas computational visual media 2025年第6期1399-1406,共8页
Estimating 3D human pose from 2D images in real world contexts remains a challenge,characterized by unique data constraints.Large general datasets of motion-captured 3D adult human poses paired with 2D images exist,bu... Estimating 3D human pose from 2D images in real world contexts remains a challenge,characterized by unique data constraints.Large general datasets of motion-captured 3D adult human poses paired with 2D images exist,but in many application settings,collection of further motion-captured data is impossible,precluding a straightforward fine-tuning approach to adaptation.We present a method for improving 3D pose estimation transfer learning to domains where there are only depth camera images available as supervision.Our heuristic weakly supervised 3D human pose(HW-HuP)estimation method learns partial pose priors from general 3D human pose datasets and employs weak supervision with depth data to guide learning in an optimization and regression cycle.We show that HW-HuP meaningfully improves upon state-of-the-art models in the adult in-bed setting,as well as on large scale public 3D human pose datasets,under comparable supervision conditions.Our model code and data are publicly available at https://github.com/ostadabbas/hw-hup.A significantly expanded version of this paper,with supplementary material,is available as a preprint on arXiv at https://arxiv.org/abs/2105.10996. 展开更多
关键词 depth data depth camera images d images optimization estimating d human pose HEURISTIC d human pose estimation transfer learning
原文传递
IIDM:Image-to-image diffusion model for semantic image synthesis
4
作者 Feng Liu Xiaobin Chang computational visual media 2025年第2期423-429,共7页
Semantic image synthesis aims to generate highquality images given semantic conditions,i.e.,segmentation masks and style reference images.Existing methods widely adopt generative adversarial networks(GANs).GANs take a... Semantic image synthesis aims to generate highquality images given semantic conditions,i.e.,segmentation masks and style reference images.Existing methods widely adopt generative adversarial networks(GANs).GANs take all conditional inputs and directly synthesize images in a single forward step.In this paper,semantic image synthesis is treated as an image denoising task and is handled with a novel image-to-image diffusion model(IIDM). 展开更多
关键词 generative adversarial networks semantic image synthesis image synthesis directly synthesize images image image diffusion model style reference imagesexisting generative adversarial networks gans gans image denoising task
原文传递
SAD: Style-aware diffusion adaptation for few-shot style transfer image generation
5
作者 Yilong Liu Houwen Zheng Shuojin Yang computational visual media 2025年第4期889-895,共7页
Diffusion-based models have recently achieved remarkable success in style transfer. However, when training data is scarce, existing methods struggle to effectively balance style and content. In this paper, we propose ... Diffusion-based models have recently achieved remarkable success in style transfer. However, when training data is scarce, existing methods struggle to effectively balance style and content. In this paper, we propose Style-Aware Diffusion (SAD), a novel method that harnesses efficient low-rank adaptation training techniques. Specifically, We extract latent representations of both style and content using DDIM inversion, formulated as an ordinary differential equation. Then, we use adaptive instance normalization and query–key–value injection to effectively align low-level style features with high-level content semantics. In addition, we propose parameter-efficient adaptation, which mitigates catastrophic forgetting and overfitting by rationally optimizing the weights of the attention layers, ensuring robust and effective performance, and achieving a 61.5% relative score increase over the plain model. The proposed method outperforms the high-performance DreamBooth-LoRA model and won the Fourth Jittor Artificial Intelligence Challenge. Our model is implemented using the Jittor framework and is available at https://github.com/liylo/jittor-qwqw-Few_Shot_Style_Transfer. 展开更多
关键词 ddim inversion adaptive instance normalizati ordinary differential equation training data style transfer few shot style transfer style aware diffusion balance style content
原文传递
Uncertainty aware multiple view stereo network with accurate supervision
6
作者 Xincheng Tang Mengqi Rong +2 位作者 Bin Fan Hongmin Liu Shuhan Shen computational visual media 2025年第5期1133-1139,共7页
Learning-based multiple view stereo has gained significant attention recently.However,most methods rely on direct network supervision using provided ground-truth depth,which poses three inherent problems:resolution-de... Learning-based multiple view stereo has gained significant attention recently.However,most methods rely on direct network supervision using provided ground-truth depth,which poses three inherent problems:resolution-dependent ground-truth artifacts,excessively challenging training examples(with relatively featureless textures),and use of less-viewed reference pixels for supervision,all of which hinder network optimization.To alleviate these problems,we propose an accurate network supervision paradigm that includes a ground-truth mask,an entropy mask,and a consistency mask,which provide more accurate supervision signals to aid network optimization. 展开更多
关键词 multiple view stereo entropy mask network optimizationto ground truth mask consistency mask direct network supervision UNCERTAINTY accurate network supervision paradigm
原文传递
FCDFusion: A fast, low color deviation method for fusing visible and infrared image pairs
7
作者 Hesong Li Ying Fu computational visual media 2025年第1期195-211,共17页
Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturat... Visible and infrared image fusion(VIF)aims to combine information from visible and infrared images into a single fused image.Previous VIF methods usually employ a color space transformation to keep the hue and saturation from the original visible image.However,for fast VIF methods,this operation accounts for the majority of the calculation and is the bottleneck preventing faster processing.In this paper,we propose a fast fusion method,FCDFusion,with little color deviation.It preserves color information without color space transformations,by directly operating in RGB color space.It incorporates gamma correction at little extra cost,allowing color and contrast to be rapidly improved.We regard the fusion process as a scaling operation on 3D color vectors,greatly simplifying the calculations.A theoretical analysis and experiments show that our method can achieve satisfactory results in only 7 FLOPs per pixel.Compared to state-of-theart fast,color-preserving methods using HSV color space,our method provides higher contrast at only half of the computational cost.We further propose a new metric,color deviation,to measure the ability of a VIF method to preserve color.It is specifically designed for VIF tasks with color visible-light images,and overcomes deficiencies of existing VIF metrics used for this purpose.Our code is available at https://github.com/HeasonLee/FCDFusion. 展开更多
关键词 infrared images visible and infrared image fusion(VIF) gamma correction real-time display color metrics color deviation
原文传递
PASS-SAM:Integration of Segment Anything Model for Large-Scale Unsupervised Semantic Segmentation
8
作者 Yin Tang Rui Chen +1 位作者 Gensheng Pei Qiong Wang computational visual media 2025年第3期669-674,共6页
Large-scale unsupervised semantic segmentation(LUSS)is a sophisticated process that aims to segment similar areas within an image without relying on labeled training data.While existing methodologies have made substan... Large-scale unsupervised semantic segmentation(LUSS)is a sophisticated process that aims to segment similar areas within an image without relying on labeled training data.While existing methodologies have made substantial progress in this area,there is ample scope for enhancement.We thus introduce the PASS-SAM model,a comprehensive solution that amalgamates the benefits of various models to improve segmentation performance. 展开更多
关键词 segmentation performance amalgamates benefits various models segment anything model pass sam model segment similar areas large scale unsupervised semantic segmentation
原文传递
Gaussian-plus-SDF SLAM: High-fidelity 3D reconstruction at 150+ fps
9
作者 Zhexi Peng Kun Zhou Tianjia Shao computational visual media 2025年第6期1195-1208,共14页
While recent Gaussian-based SLAM methods achieve photorealistic reconstruction from RGB-D data, their computational performance remains a critical bottleneck. State-of-the-art techniques operate at less than 20 fps, s... While recent Gaussian-based SLAM methods achieve photorealistic reconstruction from RGB-D data, their computational performance remains a critical bottleneck. State-of-the-art techniques operate at less than 20 fps, significantly lagging behind geometry-based approaches like KinectFusion (hundreds of fps). This limitation stems from the heavy computational burden: modeling scenes requires numerous Gaussians and complex iterative optimization to fit RGB-D data;insufficient Gaussian counts or optimization iterations cause severe quality degradation. To address this, we propose a Gaussian-SDF hybrid representation, combining a colorized signed distance field (SDF) for smooth geometry and appearance with 3D Gaussians to capture underrepresented details. The SDF is efficiently constructed via RGB-D fusion (as in geometry-based methods), while Gaussians undergo iterative optimization. Our representation enables significant Gaussian reduction (50% fewer) by avoiding full-scene Gaussian modeling, and efficient Gaussian optimization (75% fewer iterations) through targeted appearance refinement. Building upon this representation, we develop GPS-SLAM (Gaussian-plus-SDF SLAM), a real-time 3D reconstruction system achieving over 150 fps on real-world Azure Kinect sequences, faster by an order-of-magnitude than state-of-the-art techniques while maintaining comparable reconstruction quality. The source code and data are available at https://gapszju.github.io/GPS-SLAM. 展开更多
关键词 3D reconstruction Gaussian splatting SLAM RGBD cameras 3D scanning signed distance fields
原文传递
Continuous indexed points for multivariate volume visualization
10
作者 Liang Zhou Xinyi Gou Daniel Weiskopf computational visual media 2025年第6期1303-1328,共26页
We introduce continuous indexed points for improved multivariate volume visualization.Indexed points represent linear structures in parallel coordinates and can be used to encode local correlation of multivariate(incl... We introduce continuous indexed points for improved multivariate volume visualization.Indexed points represent linear structures in parallel coordinates and can be used to encode local correlation of multivariate(including multi-field,multifaceted,and multi-attribute)volume data.First,we perform local linear fitting in the spatial neighborhood of each volume sample using principal component analysis,accelerated by hierarchical spatial data structures.This local linear information is then visualized as continuous indexed points in parallel coordinates:a density representation of indexed points in a continuous domain.With our new method,multivariate volume data can be analyzed using eigenvector information from local spatial embeddings.We utilize both 1-flat and 2-flat indexed points,allowing us to identify correlations between two variables and even three variables,respectively.An interactive occlusion shading model facilitates good spatial perception of the volume rendering of volumetric correlation characteristics.Interactive exploration is supported by specifically designed multivariate transfer function widgets working in the image plane of parallel coordinates.We show that our generic technique works for multi-attribute datasets.The effectiveness and usefulness of our new method is demonstrated through a case study,an expert user study,and domain expert feedback. 展开更多
关键词 volume visualization multivariate volumes multi-field CORRELATION indexed points parallel coordinates
原文传递
Swin3D: A pretrained transformer backbone for 3D indoor scene understanding
11
作者 Yu-Qi Yang Yu-Xiao Guo +5 位作者 Jian-Yu Xiong Yang Liu Hao Pan Peng-Shuai Wang Xin Tong Baining Guo computational visual media 2025年第1期83-101,共19页
The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,call... The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach. 展开更多
关键词 3D pretraining ponitcloud analysis trans-former backbone Swin Transformer 3D semantic segmentation 3D object detection
原文传递
ARNet:Attribute artifact reduction for G-PCC compressed point clouds
12
作者 Junzhe Zhang Junteng Zhang +1 位作者 Dandan Ding Zhan Ma computational visual media 2025年第2期327-342,共16页
A learning-based adaptive loop filter is developed for the geometry-based point-cloud compression(G-PCC)standard to reduce attribute compression artifacts.The proposed method first generates multiple most probable sam... A learning-based adaptive loop filter is developed for the geometry-based point-cloud compression(G-PCC)standard to reduce attribute compression artifacts.The proposed method first generates multiple most probable sample offsets(MPSOs)as potential compression distortion approximations,and then linearly weights them for artifact mitigation.Therefore,we drive the filtered reconstruction as closely to the uncompressed PCA as possible.To this end,we devise an attribute artifact reduction network(ARNet)consisting of two consecutive processing phases:MPSOs derivation and MPSOs combination.The MPSOs derivation uses a two-stream network to model local neighborhood variations from direct spatial embedding and frequency-dependent embedding,where sparse convolutions are utilized to best aggregate information from sparsely and irregularly distributed points.The MPSOs combination is guided by the least-squares error metric to derive weighting coefficients on the fly to further capture the content dynamics of the input PCAs.ARNet is implemented as an in-loop filtering tool for GPCC,where the linear weighting coefficients are encapsulated into the bitstream with negligible bitrate overhead.The experimental results demonstrate significant improvements over the latest G-PCC both subjectively and objectively.For example,our method offers a 22.12%YUV Bjøntegaard delta rate(BDRate)reduction compared to G-PCC across various commonly used test point clouds.Compared with a recent study showing state-of-the-art performance,our work not only gains 13.23%YUV BD-Rate but also provides a 30×processing speedup. 展开更多
关键词 point cloud attribute compression sparse convolution sample offset linear coefficient
原文传递
Mindstorms in natural language-based societies of mind
13
作者 Mingchen Zhuge Haozhe Liu +23 位作者 Francesco Faccio Dylan R.Ashley Róbert Csordás Anand Gopalakrishnan Abdullah Hamdi Hasan Abed Al Kader Hammoud Vincent Herrmann Kazuki Irie Louis Kirsch Bing Li Guohao Li Shuming Liu Jinjie Mai Piotr Piękos Aditya A.Ramesh Imanol Schlag Weimin Shi Aleksandar Stanić Wenyi Wang Yuhui Wang Mengmeng Xu Deng-Ping Fan Bernard Ghanem Jürgen Schmidhuber computational visual media 2025年第1期29-81,共53页
Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine ... Inspired by Minsky’s Society of Mind,Schmidhuber’s Learning to Think,and other more 9-16 recent works,this paper proposes and advocates for the concept of natural language-based societies of mind(NLSOMs).We imagine these societies as consisting of a collection of multimodal neural networks,including large language models,which engage in a“mindstorm”to solve problems using a shared natural language interface.Here,we work to identify and discuss key questions about the social structure,governance,and economic principles for NLSOMs,emphasizing their impact on the future of AI.Our demonstrations with NLSOMs—which feature up to 129 agents—show their effectiveness in various tasks,including visual question answering,image captioning,and prompt generation for text-to-image synthesis. 展开更多
关键词 mindstorm society of mind(SOM) large languagemodels(LLMs) multimodal learning learning to think
原文传递
Multi-task gradual inference with a single encoder–decoder network for automatic portrait matting
14
作者 Wenbing Yang Wei Ma +1 位作者 Qing Mi Hongbin Zha computational visual media 2025年第6期1385-1398,共14页
This paper presents a multi-task gradual inference model,MTGINet,for automatic portrait matting.It handles the subtasks of automatic portrait matting,namely portrait–transition–background trimap segmentation and tra... This paper presents a multi-task gradual inference model,MTGINet,for automatic portrait matting.It handles the subtasks of automatic portrait matting,namely portrait–transition–background trimap segmentation and transition region matting,with a single encoder–decoder structure.First,we enrich the highest stage of features from the encoder with portrait shape context via a shape context aggregation(SCA)module for trimap segmentation.Then,we fuse the SCA-enhanced features with detailed clues from the encoder for transition-region-aware alpha matting.The gradual inference model naturally allows sufficient interaction between the subtasks via forward computation and backwards propagation during training,and therefore achieves high accuracy while maintaining low complexity.In addition,considering the discrepancies in feature requirements across subtasks,we adapt the features from the encoders before reusing them via a feature rectification module.In addition to the MTGINet model,we have constructed a new large-scale dataset,HPM-17K,for half-body portrait matting.It consists of 16,967 images with diverse backgrounds.Comparative experiments with existing deep models on the public P3M-10K dataset and our HPM-17K dataset demonstrate that the proposed model exhibits state-of-the-art performance. 展开更多
关键词 portrait matting alpha matting multi-task gradual inference background replacement encoder-decoder network
原文传递
LucIE: Language-guided local image editing for fashion images
15
作者 Huanglu Wen Shaodi You Ying Fu computational visual media 2025年第1期179-194,共16页
Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose... Language-guided fashion image editing is challenging,as fashion image editing is local and requires high precision,while natural language cannot provide precise visual information for guidance.In this paper,we propose LucIE,a novel unsupervised language-guided local image editing method for fashion images.LucIE adopts and modifies recent text-to-image synthesis network,DF-GAN,as its backbone.However,the synthesis backbone often changes the global structure of the input image,making local image editing impractical.To increase structural consistency between input and edited images,we propose Content-Preserving Fusion Module(CPFM).Different from existing fusion modules,CPFM prevents iterative refinement on visual feature maps and accumulates additive modifications on RGB maps.LucIE achieves local image editing explicitly with language-guided image segmentation and maskguided image blending while only using image and text pairs.Results on the DeepFashion dataset shows that LucIE achieves state-of-the-art results.Compared with previous methods,images generated by LucIE also exhibit fewer artifacts.We provide visualizations and perform ablation studies to validate LucIE and the CPFM.We also demonstrate and analyze limitations of LucIE,to provide a better understanding of LucIE. 展开更多
关键词 deep learning language-guided image editing local image editing content preservation fashion images
原文传递
Unified Transformed t-SVD Using Unfolding Tensors for Visual Inpainting
16
作者 Mengjie Qin Wen Wang +3 位作者 Honghui Xu Te Li Chunlong Zhang Minhong Wan computational visual media 2025年第3期549-568,共20页
Low-rank tensor completion(LRTC)restores missing elements in multidimensional visual data;the challenge is representing the inherent structures within this data.Typical methods either suffer from inefficiency owing to... Low-rank tensor completion(LRTC)restores missing elements in multidimensional visual data;the challenge is representing the inherent structures within this data.Typical methods either suffer from inefficiency owing to the combination of multiple regularizers or perform suboptimally using inappropriate priors.In this study,we further investigated LRTC using tensor singular value decomposition(t-SVD).Inspired by the tensor-tensor product(t-product),we proposed a unified transformed t-SVD method that employs an invertible linear transform with a unitary transform matrix.However,the t-SVD-based framework lacks the flexibility necessary to represent different inherent relations along the tensor modes.To address this issue,we propose a tensor represented by a series of multidimensional unfolding tensors to fully explore the hidden structure of the original data.Furthermore,the proposed model can be solved efficiently using the alternate-direction method of the multiplier(ADMM)approach.Extensive experimental results on multidimensional visual data(multispectral images,hyperspectral images,and videos)demonstrated the superiority of the proposed method over other state-of-the-art LRTC-related methods. 展开更多
关键词 tensor singular value decomposition(t-SVD) invertible linear transform unitary transform unfolding tensors tensor completion(TC)
原文传递
WDFSR: Normalizing flow based on the wavelet-domain for super-resolution
17
作者 Chao Song Shaobang Li +1 位作者 Frederick W.B.Li Bailin Yang computational visual media 2025年第2期381-404,共24页
We propose a normalizing flow based on the wavelet framework for super-resolution(SR)called WDFSR.It learns the conditional distribution mapping between low-resolution images in the RGB domain and high-resolution imag... We propose a normalizing flow based on the wavelet framework for super-resolution(SR)called WDFSR.It learns the conditional distribution mapping between low-resolution images in the RGB domain and high-resolution images in the wavelet domain to simultaneously generate high-resolution images of different styles.To address the issue of some flowbased models being sensitive to datasets,which results in training fluctuations that reduce the mapping ability of the model and weaken generalization,we designed a method that combines a T-distribution and QR decomposition layer.Our method alleviates this problem while maintaining the ability of the model to map different distributions and produce higher-quality images.Good contextual conditional features can promote model training and enhance the distribution mapping capabilities for conditional distribution mapping.Therefore,we propose a Refinement layer combined with an attention mechanism to refine and fuse the extracted condition features to improve image quality.Extensive experiments on several SR datasets demonstrate that WDFSR outperforms most general CNN-and flow-based models in terms of PSNR value and perception quality.We also demonstrated that our framework works well for other low-level vision tasks,such as low-light enhancement.The pretrained models and source code with guidance for reference are available at https://github.com/Lisbegin/WDFSR. 展开更多
关键词 normalizing flow super-resolution(SR) wavelet domain attention mechanism generative model
原文传递
Remote sensing tuning:A survey
18
作者 Dongshuo Yin Ting-Feng Zhao +4 位作者 Deng-Ping Fan Shutao Li Bo Du Xian Sun Shi-Min Hu computational visual media 2025年第5期897-937,共41页
Large models have accelerated the development of intelligent interpretation in remote sensing.Many remote sensing foundation models(RSFM)have emerged in recent years,sparking a new wave of deep learning in this field.... Large models have accelerated the development of intelligent interpretation in remote sensing.Many remote sensing foundation models(RSFM)have emerged in recent years,sparking a new wave of deep learning in this field.Fine-tuning techniques serve as a bridge between remote sensing downstream tasks and advanced foundation models.As RSFMs become more powerful,fine-tuning techniques are expected to lead the next research frontier in numerous critical remote sensing applications.Advanced fine-tuning techniques can reduce the data and computational resource requirements during the downstream adaptation process.Current fine-tuning techniques for remote sensing are still in their early stages,leaving a large space for optimization and application.To elucidate the current development and future trends of remote sensing fine-tuning techniques,this survey offers a comprehensive overview of recent research.Specifically,this survey summarizes the applications and innovations of each work and categorizes recent remote sensing fine-tuning techniques into six types:adapter-based,prompt-based,reparameterization-based,hybrid methods,partial tuning,and improved tuning. 展开更多
关键词 remote sensing deep learning foundation models fine-tuning pre-training
原文传递
MagicTalk: Implicit and explicit correlation learning for diffusion-based emotional talking face generation
19
作者 Chenxu Zhang Chao Wang +7 位作者 Jianfeng Zhang Hongyi Xu Guoxian Song You Xie Linjie Luo Yapeng Tian Jiashi Feng Xiaohu Guo computational visual media 2025年第4期763-779,共17页
Generating emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as express... Generating emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for lip-sync accuracy. Prevailing generative works usually struggle to juggle to generate subtle variations of emotional expression and lip-synchronized talking. To address these challenges, we suggest modeling the implicit and explicit correlations between audio and emotional talking faces with a unified framework. As human emotional expressions usually present subtle and implicit relations with speech audio, we propose incorporating audio and emotional style embeddings into the diffusion-based generation process, for realistic generation while concentrating on emotional expressions. We then propose lip-based explicit correlation learning to construct a strong mapping of audio to lip motions, assuring lip-audio synchronization. Furthermore, we deploy a video-to-video rendering module to transfer expressions and lip motions from a proxy 3D avatar to an arbitrary portrait. Both quantitatively and qualitatively, MagicTalk outperforms state-of-the-art methods in terms of expressiveness, lip-sync, and perceptual quality. 展开更多
关键词 emotions talking face generation diffusion model images implicit and explicit correlation learning
原文传递
ImVoxelENet:Image to voxels epipolar transformer for multi-view RGB-based 3D object detection
20
作者 Gang Xu Haoyu Liu +1 位作者 Biao Leng Zhang Xiong computational visual media 2025年第4期871-888,共18页
The task of detecting three-dimensional objects using only RGB images presents a considerable challenge within the domain of computer vision.The core issue lies in accurately performing epipolar geometry matching betw... The task of detecting three-dimensional objects using only RGB images presents a considerable challenge within the domain of computer vision.The core issue lies in accurately performing epipolar geometry matching between multiple views to obtain latent geometric priors.Existing methods establish correspondences along epipolar line features in voxel space through various layers of convolution.However,this step often occurs in the later stages of the network,which limits overall performance.To address this challenge,we introduce a novel framework,ImVoxelENet,that integrates a geometric epipolar constraint.We start from the back-projection of pixel-wise features and design an attention mechanism that captures the relationship between forward and backward features along the ray for multiple views.This approach enables the early establishment of geometric correspondences and structural connections between epipolar lines.Using ScanNetV2 as a benchmark,extensive comparative and ablation experiments demonstrate that our proposed network achieves a 1.1%improvement in mAP,highlighting its effectiveness in enhancing 3D object detection performance.Our code is available at https://github.com/xug-coder/ImVoxelENet. 展开更多
关键词 3D object detection epipolar geometry TRANSFORMERS ATTENTION deep learning
原文传递
上一页 1 2 25 下一页 到第
使用帮助 返回顶部