Talking head generation based on neural radiance fields(NeRF)has gained prominence,primarily owing to its implicit 3D representation capability within neural networks.However,most NeRF-based methods often intertwine a...Talking head generation based on neural radiance fields(NeRF)has gained prominence,primarily owing to its implicit 3D representation capability within neural networks.However,most NeRF-based methods often intertwine audio-to-video conversion in a joint training process,resulting in challenges such as inadequate lip synchronization,limited learning efficiency,large memory requirement,and lack of editability.In response to these issues,this paper introduces a fully decoupled NeRF-based method for generating talking heads.This method separates audio-to-video conversion into two stages through the use of facial landmarks.Notably,the Transformer network is used to effectively establish the cross-modal connection between audio and landmarks and to generate landmarks conforming to the distribution of training data.We also explore formant features of the audio as additional conditions to guide landmark generation.Then,these landmarks are combined with Gaussian relative position coding to refine the sampling points on the rays,thereby constructing a dynamic NeRF conditioned on these landmarks and audio features for rendering the generated head.This decoupled setup enhances both the fidelity and flexibility of mapping audio to video with two independent small-scale networks.Additionally,it supports the generation of the torso from the head-only image with improved StyleUnet,further enhancing the realism of the generated talking head.Our experimental results demonstrate that our method excels in producing lifelike talking heads,and that the lightweight neural network models also exhibit superior speed and learning efficiency with lower memory requirements.展开更多
Traditional neural radiance fields for rendering novel views require intensive input images and pre-scene optimization,which limits their practical applications.We propose a generalization method to infer scenes from ...Traditional neural radiance fields for rendering novel views require intensive input images and pre-scene optimization,which limits their practical applications.We propose a generalization method to infer scenes from input images and perform high-quality rendering without pre-scene optimization named SG-NeRF(Sparse-Input Generalized Neural Radiance Fields).Firstly,we construct an improved multi-view stereo structure based on the convolutional attention and multi-level fusion mechanism to obtain the geometric features and appearance features of the scene from the sparse input images,and then these features are aggregated by multi-head attention as the input of the neural radiance fields.This strategy of utilizing neural radiance fields to decode scene features instead of mapping positions and orientations enables our method to perform cross-scene training as well as inference,thus enabling neural radiance fields to generalize for novel view synthesis on unseen scenes.We tested the generalization ability on DTU dataset,and our PSNR(peak signal-to-noise ratio)improved by 3.14 compared with the baseline method under the same input conditions.In addition,if the scene has dense input views available,the average PSNR can be improved by 1.04 through further refinement training in a short time,and a higher quality rendering effect can be obtained.展开更多
同步定位与建图(simultaneous localization and mapping,SLAM)是指在未知环境中同时实现自主移动机器人的定位和环境地图构建,其在机器人技术和自动驾驶等领域有着重要价值。本文首先回顾SLAM技术的发展历程,从早期的手工特征提取方法...同步定位与建图(simultaneous localization and mapping,SLAM)是指在未知环境中同时实现自主移动机器人的定位和环境地图构建,其在机器人技术和自动驾驶等领域有着重要价值。本文首先回顾SLAM技术的发展历程,从早期的手工特征提取方法到现代的深度学习驱动的解决方案。其中,基于神经辐射场(neural radiance fields,NeRF)的SLAM方法利用神经网络进行场景表征,进一步提高了建图的可视化效果。然而,这类方法在渲染速度上仍然面临挑战,限制了其实时应用的可能性。相比之下,基于高斯溅射(Gaussian splatting,GS)的SLAM方法以其实时的渲染速度和照片级的场景渲染效果,为SLAM领域带来新的研究热点和机遇。接着,按照RGB/RGBD、多模态数据以及语义信息3种不同应用类型对基于高斯溅射的SLAM方法进行分类和总结,并针对每种情况讨论相应SLAM方法的优势和局限性。最后,针对当前基于高斯溅射的SLAM方法面临的实时性、基准一致化、大场景的扩展性以及灾难性遗忘等问题进行分析,并对未来研究方向进行展望。通过这些探讨和分析,旨在为SLAM领域的研究人员和工程师提供全面的视角和启发,帮助分析和理解当前SLAM系统面临的关键问题,推动该领域的技术进步和应用拓展。展开更多
We evaluate different Neural Radiance Field(NeRF)techniques for the 3D reconstruction of plants in varied environments,from indoor settings to outdoor fields.Traditional methods usually fail to capture the complex geo...We evaluate different Neural Radiance Field(NeRF)techniques for the 3D reconstruction of plants in varied environments,from indoor settings to outdoor fields.Traditional methods usually fail to capture the complex geometric details of plants,which is crucial for phenotyping and breeding studies.We evaluate the reconstruction fidelity of NeRFs in 3 scenarios with increasing complexity and compare the results with the point cloud obtained using light detection and ranging as ground truth.In the most realistic field scenario,the NeRF models achieve a 74.6%F1 score after 30 min of training on the graphics processing unit,highlighting the efficacy of NeRFs for 3D reconstruction in challenging environments.Additionally,we propose an early stopping technique for NeRF training that almost halves the training time while achieving only a reduction of 7.4%in the average F1 score.This optimization process substantially enhances the speed and efficiency of 3D reconstruction using NeRFs.Our findings demonstrate the potential of NeRFs in detailed and realistic 3D plant reconstruction and suggest practical approaches for enhancing the speed and efficiency of NeRFs in the 3D reconstruction process.展开更多
文摘【目的】解决复杂果园环境下的果实精准分割问题。【方法】本文提出一种新颖的柑橘果树三维重建与果实语义分割方法。首先,利用神经辐射场(Neural radiance field,NeRF)技术从多视角图像中学习果树的隐式三维表示,生成高质量的果树点云模型;然后,采用改进后的随机局部点云特征聚合网络(Random local point cloud feature aggregation network,RandLA-Net)对果树点云进行端到端的语义分割,准确提取出果实点云。对RandLA-Net进行针对性改进,在编码器层后增加双边增强模块,采用更适合果实点云分割任务的损失函数,并通过柑橘果树数据集对改进后的分割网络进行验证试验。【结果】所提出的方法能够有效地重建果树三维结构,改进后网络的平均交并比提高了2.64个百分点,果实的交并比提高了7.33个百分点,验证了该方法在智慧果园场景下的实用性。【结论】研究为实现果园智能化管理和自动化采摘提供了新的技术支撑。
基金supported in part by the Natural Science Foundation of Shandong Province(ZR2024ZD12).
文摘Talking head generation based on neural radiance fields(NeRF)has gained prominence,primarily owing to its implicit 3D representation capability within neural networks.However,most NeRF-based methods often intertwine audio-to-video conversion in a joint training process,resulting in challenges such as inadequate lip synchronization,limited learning efficiency,large memory requirement,and lack of editability.In response to these issues,this paper introduces a fully decoupled NeRF-based method for generating talking heads.This method separates audio-to-video conversion into two stages through the use of facial landmarks.Notably,the Transformer network is used to effectively establish the cross-modal connection between audio and landmarks and to generate landmarks conforming to the distribution of training data.We also explore formant features of the audio as additional conditions to guide landmark generation.Then,these landmarks are combined with Gaussian relative position coding to refine the sampling points on the rays,thereby constructing a dynamic NeRF conditioned on these landmarks and audio features for rendering the generated head.This decoupled setup enhances both the fidelity and flexibility of mapping audio to video with two independent small-scale networks.Additionally,it supports the generation of the torso from the head-only image with improved StyleUnet,further enhancing the realism of the generated talking head.Our experimental results demonstrate that our method excels in producing lifelike talking heads,and that the lightweight neural network models also exhibit superior speed and learning efficiency with lower memory requirements.
基金supported by the Zhengzhou Collaborative Innovation Major Project under Grant No.20XTZX06013the Henan Provincial Key Scientific Research Project of China under Grant No.22A520042。
文摘Traditional neural radiance fields for rendering novel views require intensive input images and pre-scene optimization,which limits their practical applications.We propose a generalization method to infer scenes from input images and perform high-quality rendering without pre-scene optimization named SG-NeRF(Sparse-Input Generalized Neural Radiance Fields).Firstly,we construct an improved multi-view stereo structure based on the convolutional attention and multi-level fusion mechanism to obtain the geometric features and appearance features of the scene from the sparse input images,and then these features are aggregated by multi-head attention as the input of the neural radiance fields.This strategy of utilizing neural radiance fields to decode scene features instead of mapping positions and orientations enables our method to perform cross-scene training as well as inference,thus enabling neural radiance fields to generalize for novel view synthesis on unseen scenes.We tested the generalization ability on DTU dataset,and our PSNR(peak signal-to-noise ratio)improved by 3.14 compared with the baseline method under the same input conditions.In addition,if the scene has dense input views available,the average PSNR can be improved by 1.04 through further refinement training in a short time,and a higher quality rendering effect can be obtained.
文摘目的 基于点云的神经渲染方法受点云质量及特征提取的影响,易导致新视角合成图像渲染质量下降,为此提出一种融合局部空间信息的新视角合成方法。方法 针对点云质量及提取特征不足的问题,首先,设计一种神经点云特征对齐模块,将点云与图像匹配区域的特征进行对齐,融合后构成神经点云,提升其特征的局部表达能力;其次,提出一种神经点云Transformer模块,用于融合局部神经点云的上下文信息,在点云质量不佳的情况下仍能提取可靠的局部空间信息,有效增强了点云神经渲染方法的合成质量。结果 实验结果表明,在真实场景数据集中,对于只包含单一物品的数据集Tanks and Temples,本文方法在峰值信噪比(peak signal to noise ratio,PSNR)指标上与NeRF(neural radiance field)方法相比提升19.2%,相较于使用点云输入的方法 Tetra-NeRF和Point-NeRF分别提升了6.4%和3.8%,即使在场景更为复杂的ScanNet数据集中,与NeRF方法及Point-NeRF相比分别提升了34.6%和2.1%。结论 本文方法能够更好地利用点云的局部空间信息,有效改善了稀疏视角图像输入下因点云质量和提取特征导致的渲染质量下降,实验结果验证了本文方法的有效性。
基金supported in part by the Plant Science Institute at IowaState University,the National Science Foundation under grant number OAC:1750865the National Institute of Food and Agriculture(USDA-NIFA)as part of the AI Institute for Resilient Agriculture(AIIRA),grant number 2021-67021-35329.
文摘We evaluate different Neural Radiance Field(NeRF)techniques for the 3D reconstruction of plants in varied environments,from indoor settings to outdoor fields.Traditional methods usually fail to capture the complex geometric details of plants,which is crucial for phenotyping and breeding studies.We evaluate the reconstruction fidelity of NeRFs in 3 scenarios with increasing complexity and compare the results with the point cloud obtained using light detection and ranging as ground truth.In the most realistic field scenario,the NeRF models achieve a 74.6%F1 score after 30 min of training on the graphics processing unit,highlighting the efficacy of NeRFs for 3D reconstruction in challenging environments.Additionally,we propose an early stopping technique for NeRF training that almost halves the training time while achieving only a reduction of 7.4%in the average F1 score.This optimization process substantially enhances the speed and efficiency of 3D reconstruction using NeRFs.Our findings demonstrate the potential of NeRFs in detailed and realistic 3D plant reconstruction and suggest practical approaches for enhancing the speed and efficiency of NeRFs in the 3D reconstruction process.