This paper presents a novel approach for camera pose refinement based on neural radiance fields(NeRF)by introducing semantic feature consistency to enhance robustness.NeRF has been successfully applied to camera pose ...This paper presents a novel approach for camera pose refinement based on neural radiance fields(NeRF)by introducing semantic feature consistency to enhance robustness.NeRF has been successfully applied to camera pose estimation by inverting the rendering process given an observed RGB image and an initial pose estimate.However,previous methods only adopted photometric consistency for pose optimization,which is prone to be trapped in local minima.To address this problem,we introduce semantic feature consistency into the existing framework.Specifically,we utilize high-level features extracted from a convolutional neural network(CNN)pre-trained for image recognition,and maintain consistency of such features between observed and rendered images during the optimization procedure.Unlike the color values at each pixel,these features contain rich semantic information shared within local regions and can be more robust to appearance changes from different viewpoints.Since it is computationally expensive to render a full image with NeRF for feature extraction from CNN,we propose an efficient way to estimate the features of individually rendered pixels by projecting them to a nearby reference image and interpolating its feature maps.Extensive experiments show that our method greatly outperforms the baseline method on both synthetic objects and real-world large indoor scenes,increasing the accuracy of pose estimation by over 6.4%.展开更多
文摘This paper presents a novel approach for camera pose refinement based on neural radiance fields(NeRF)by introducing semantic feature consistency to enhance robustness.NeRF has been successfully applied to camera pose estimation by inverting the rendering process given an observed RGB image and an initial pose estimate.However,previous methods only adopted photometric consistency for pose optimization,which is prone to be trapped in local minima.To address this problem,we introduce semantic feature consistency into the existing framework.Specifically,we utilize high-level features extracted from a convolutional neural network(CNN)pre-trained for image recognition,and maintain consistency of such features between observed and rendered images during the optimization procedure.Unlike the color values at each pixel,these features contain rich semantic information shared within local regions and can be more robust to appearance changes from different viewpoints.Since it is computationally expensive to render a full image with NeRF for feature extraction from CNN,we propose an efficient way to estimate the features of individually rendered pixels by projecting them to a nearby reference image and interpolating its feature maps.Extensive experiments show that our method greatly outperforms the baseline method on both synthetic objects and real-world large indoor scenes,increasing the accuracy of pose estimation by over 6.4%.