This paper presents some techniques for synthesizing novel view for a virtual viewpoint from two given views cap-tured at different viewpoints to achieve both high quality and high efficiency. The whole process consis...This paper presents some techniques for synthesizing novel view for a virtual viewpoint from two given views cap-tured at different viewpoints to achieve both high quality and high efficiency. The whole process consists of three passes. The first pass recovers depth map. We formulate it as pixel labelling and propose a bisection approach to solve it. It is accomplished in log2n(n is the number of depth levels) steps,each of which involves a single graph cut computation. The second pass detects occluded pixels and reasons about their depth. It fits a foreground depth curve and a background depth curve using depth of nearby fore-ground and background pixels,and then distinguishes foreground and background pixels by minimizing a global energy,which involves only one graph cut computation. The third pass finds for each pixel in the novel view the corresponding pixels in the input views and computes its color. The whole process involves only a small number of graph cut computations,therefore it is efficient. And,visual artifacts in the synthesized view can be removed successfully by correcting depth of the occluded pixels. Experimental results demonstrate that both high quality and high efficiency are achieved by the proposed techniques.展开更多
A new method is proposed for synthesizing intermediate views from a pair of stereoscopic images. In order to synthesize high-quality intermediate views, the block matching method together with a simplified multi-windo...A new method is proposed for synthesizing intermediate views from a pair of stereoscopic images. In order to synthesize high-quality intermediate views, the block matching method together with a simplified multi-window technique and dynamic programming is used in the process of disparity estimation. Then occlusion detection is performed to locate occluded regions and their disparities are compensated. After the projecton of the left-to-right and right-to-left disparities onto the intermediate image, intermediate view is synthesized considering occluded regions. Experimental results show that our synthesis method can obtain intermediate views with higher quality.展开更多
A new method of view synthesis is proposed based on Delaunay triangulation. The first step of this method is making the Delaunay triangulation of 2 reference images. Secondly, matching the image points using the epipo...A new method of view synthesis is proposed based on Delaunay triangulation. The first step of this method is making the Delaunay triangulation of 2 reference images. Secondly, matching the image points using the epipolar geometry constraint. Finally, constructing the third view according to pixel transferring under the trilinear constraint. The method gets rid of the classic time consuming dense matching technique and takes advantage of Delaunay triangulation. So it can not only save the computation time but also enhance the quality of the synthesized view. The significance of this method is that it can be used directly in the fields of video coding, image compressing and virtual reality.展开更多
For the pre-acquired serial images from camera lengthways motion, a view synthesis algorithm based on epipolar geometry constraint is proposed in this paper. It uses the whole matching and maintaining order characters...For the pre-acquired serial images from camera lengthways motion, a view synthesis algorithm based on epipolar geometry constraint is proposed in this paper. It uses the whole matching and maintaining order characters of the epipolar line, Fourier transform and dynamic programming matching theories, thus truly synthesizing the destination image of current viewpoint. Through the combination of Fourier transform, epipolar geometry constraint and dynamic programming matching, the circumference distortion problem resulting from conventional view synthesis approaches is effectively avoided. The detailed implementation steps of this algorithm are given, and some running instances are presented to illustrate the results.展开更多
Depth maps are used for synthesis virtual view in free-viewpoint television (FTV) systems. When depth maps are derived using existing depth estimation methods, the depth distortions will cause undesirable artifacts ...Depth maps are used for synthesis virtual view in free-viewpoint television (FTV) systems. When depth maps are derived using existing depth estimation methods, the depth distortions will cause undesirable artifacts in the synthesized views. To solve this problem, a 3D video quality model base depth maps (D-3DV) for virtual view synthesis and depth map coding in the FTV applications is proposed. First, the relationships between distortions in coded depth map and rendered view are derived. Then, a precisely 3DV quality model based depth characteristics is develop for the synthesized virtual views. Finally, based on D-3DV model, a multilateral filtering is applied as a pre-processed filter to reduce rendering artifacts. The experimental results evaluated by objective and subjective methods indicate that the proposed D-3DV model can reduce bit-rate of depth coding and achieve better rendering quality.展开更多
View synthesis is an important building block in three dimension(3D) video processing and communications.Based on one or several views,view synthesis creates other views for the purpose of view prediction(for compr...View synthesis is an important building block in three dimension(3D) video processing and communications.Based on one or several views,view synthesis creates other views for the purpose of view prediction(for compression) or view rendering(for multiview-display).The quality of view synthesis depends on how one fills the occlusion area as well as how the pixels are created.Consequently,luminance adjustment and hole filling are two key issues in view synthesis.In this paper,two views are used to produce an arbitrary virtual synthesized view.One view is merged into another view using a local luminance adjustment method,based on local neighborhood region for the calculation of adjustment coefficient.Moreover,a maximum neighborhood spreading strength hole filling method is presented to deal with the micro texture structure when the hole is being filled.For each pixel at the hole boundary,its neighborhood pixels with the maximum spreading strength direction are selected as candidates;and among them,the pixel with the maximum spreading strength is used to fill the hole from boundary to center.If there still exist disocclusion pixels after once scan,the filling process is repeated until all hole pixels are filled.Simulation results show that the proposed method is efficient,robust and achieves high performance in subjection and objection.展开更多
Traditional neural radiance fields for rendering novel views require intensive input images and pre-scene optimization,which limits their practical applications.We propose a generalization method to infer scenes from ...Traditional neural radiance fields for rendering novel views require intensive input images and pre-scene optimization,which limits their practical applications.We propose a generalization method to infer scenes from input images and perform high-quality rendering without pre-scene optimization named SG-NeRF(Sparse-Input Generalized Neural Radiance Fields).Firstly,we construct an improved multi-view stereo structure based on the convolutional attention and multi-level fusion mechanism to obtain the geometric features and appearance features of the scene from the sparse input images,and then these features are aggregated by multi-head attention as the input of the neural radiance fields.This strategy of utilizing neural radiance fields to decode scene features instead of mapping positions and orientations enables our method to perform cross-scene training as well as inference,thus enabling neural radiance fields to generalize for novel view synthesis on unseen scenes.We tested the generalization ability on DTU dataset,and our PSNR(peak signal-to-noise ratio)improved by 3.14 compared with the baseline method under the same input conditions.In addition,if the scene has dense input views available,the average PSNR can be improved by 1.04 through further refinement training in a short time,and a higher quality rendering effect can be obtained.展开更多
Novel View Synthesis(NVS) is an important task for 3D interpretation in remote sensing scenes, which also benefits vicinagearth security by enhancing situational awareness capabilities. Recently, NVS methods based on ...Novel View Synthesis(NVS) is an important task for 3D interpretation in remote sensing scenes, which also benefits vicinagearth security by enhancing situational awareness capabilities. Recently, NVS methods based on Neural Radiance Fields(Ne RFs) have attracted increasing attention for self-supervised training and highly photo-realistic synthesis results.However, it is still challenging to synthesize novel view images in remote sensing scenes, given the complexity of land covers and the sparsity of input multi-view images. In this paper, we propose a novel NVS method named FRe SNe RF, which combines Image-Based Rendering(IBR) and Ne RF to achieve high-quality results in remote sensing scenes with sparse input. We effectively solve the degradation problem by adopting the sampling space annealing method.Additionally, we introduce depth smoothness based on the segmentation mask to constrain the scene geometry. Experiments on multiple scenes show the superiority of our proposed FRe SNe RF over other methods.展开更多
The newly emerging neural radiance fields(NeRF)methods can implicitly fulfill three-dimensional(3D)reconstruction via training a neural network to render novel-view images of a given scene with given posed images.The ...The newly emerging neural radiance fields(NeRF)methods can implicitly fulfill three-dimensional(3D)reconstruction via training a neural network to render novel-view images of a given scene with given posed images.The Instant Neural Graphics Primitives(Instant-NGP)method further improves the position encoding of NeRF.It obtains state-of-the-art efficiency.However,only a local pixel-wised loss is considered when training the Instant-NGP while overlooking the nonlocal structural information between pixels.Despite a good quantitative result,it leads to a poor visual effect,especially the completeness.Inspired by the stochastic structural similarity(S3IM)method that exploits nonlocal structural information of groups of pixels,this paper proposes a new method to improve the completeness of fast novel view synthesis.The proposed method first extends the thread-wised processing of the Instant-NGP to the processing in a customthread block(i.e.,a group of threads).Then,the relative dimensionless global error in synthesis,i.e.,Erreur Relative Globale Adimensionnelle de Synthese(ERGAS),of a group of pixels corresponding to a group of threads is computed and incorporated into the loss function.Extensive experiments validate the proposed method.It can obtain better quantitative results than the original Instant-NGP with fewer iteration steps.PSNR is increased by 1%.Amazing qualitative results are obtained,especially for delicate structures and details such as lines and continuous structures.With the dramatic improvements in the visual effects,our method can boost the practicability of implicit 3D reconstruction in applications such as self-driving and augmented reality.展开更多
Indoor visual localization,i.e.,6 Degree-of-Freedom camera pose estimation for a query image with respect to a known scene,is gaining increased attention driven by rapid progress of applications such as robotics and a...Indoor visual localization,i.e.,6 Degree-of-Freedom camera pose estimation for a query image with respect to a known scene,is gaining increased attention driven by rapid progress of applications such as robotics and augmented reality.However,drastic visual discrepancies between an onsite query image and prerecorded indoor images cast a significant challenge for visual localization.In this paper,based on the key observation of the constant existence of planar surfaces such as floors or walls in indoor scenes,we propose a novel system incorporating geometric information to address issues using only pixelated images.Through the system implementation,we contribute a hierarchical structure consisting of pre-scanned images and point cloud,as well as a distilled representation of the planar-element layout extracted from the original dataset.A view synthesis procedure is designed to generate synthetic images as complementary to that of a sparsely sampled dataset.Moreover,a global image descriptor based on the image statistic modality,called block mean,variance,and color(BMVC),was employed to speed up the candidate pose identification incorporated with a traditional convolutional neural network(CNN)descriptor.Experimental results on a popular benchmark demonstrate that the proposed method outperforms the state-of-the-art approaches in terms of visual localization validity and accuracy.展开更多
Novel viewpoint image synthesis is very challenging,especially from sparse views,due to large changes in viewpoint and occlusion.Existing image-based methods fail to generate reasonable results for invisible regions,w...Novel viewpoint image synthesis is very challenging,especially from sparse views,due to large changes in viewpoint and occlusion.Existing image-based methods fail to generate reasonable results for invisible regions,while geometry-based methods have difficulties in synthesizing detailed textures.In this paper,we propose STATE,an end-to-end deep neural network,for sparse view synthesis by learning structure and texture representations.Structure is encoded as a hybrid feature field to predict reasonable structures for invisible regions while maintaining original structures for visible regions,and texture is encoded as a deformed feature map to preserve detailed textures.We propose a hierarchical fusion scheme with intra-branch and inter-branch aggregation,in which spatio-view attention allows multi-view fusion at the feature level to adaptively select important information by regressing pixel-wise or voxel-wise confidence maps.By decoding the aggregated features,STATE is able to generate realistic images with reasonable structures and detailed textures.Experimental results demonstrate that our method achieves qualitatively and quantitatively better results than state-of-the-art methods.Our method also enables texture and structure editing applications benefiting from implicit disentanglement of structure and texture.Our code is available at http://cic.tju.edu.cn/faculty/likun/projects/STATE.展开更多
Typical stereo algorithms treat disparity estimation and view synthesis as two sequential procedures.In this paper,we consider stereo matching and view synthesis as two complementary components,and present a novel ite...Typical stereo algorithms treat disparity estimation and view synthesis as two sequential procedures.In this paper,we consider stereo matching and view synthesis as two complementary components,and present a novel iterative refinement model for joint view synthesis and disparity refinement.To achieve the mutual promotion between view synthesis and disparity refinement,we apply two key strategies,disparity maps fusion and disparity-assisted plane sweep-based rendering(DAPSR).On the one hand,the disparity maps fusion strategy is applied to generate disparity map from synthesized view and input views.This strategy is able to detect and counteract disparity errors caused by potential artifacts from synthesized view.On the other hand,the DAPSR is used for view synthesis and updating,and is able to weaken the interpolation errors caused by outliers in the disparity maps.Experiments on Middlebury benchmarks demonstrate that by introducing the synthesized view,disparity errors due to large occluded region and large baseline are eliminated effectively and the synthesis quality is greatly improved.展开更多
In recent years,the concept of digital human has attracted widespread attention from all walks of life,and the modelling of high-fidelity human bodies,heads,and hands has been intensively studied.This paper focuses on...In recent years,the concept of digital human has attracted widespread attention from all walks of life,and the modelling of high-fidelity human bodies,heads,and hands has been intensively studied.This paper focuses on head modelling and proposes a generic head parametric model based on neural radiance fields.Specifically,we first use face recognition networks and 3D facial expression database FaceWarehouse to parameterize identity and expression semantics,respectively,and use both as conditional inputs to build a neural radiance field for the human head,thereby improving the head model’s representation ability while ensuring editing capabilities for the identity and expression of the rendered results;then,through a combination of volume rendering and neural rendering,the 3D representation of the head is rapidly rendered into the 2D plane,producing a high-fidelity image of the human head.Thanks to the well-designed loss functions and good implicit representation of the neural radiance field,our model can not only edit the identity and expression independently,but also freely modify the virtual camera position of the rendering results.It has excellent multi-view consistency,and has many applications in novel view synthesis,pose driving and more.展开更多
With the popularity of the digital human body,monocular three-dimensional(3D)face reconstruction is widely used in fields such as animation and face recognition.Although current methods trained using single-view image...With the popularity of the digital human body,monocular three-dimensional(3D)face reconstruction is widely used in fields such as animation and face recognition.Although current methods trained using single-view image sets perform well in monocular 3D face reconstruction tasks,they tend to rely on the constraints of the a priori model or the appearance conditions of the input images,fundamentally because of the inability to propose an effective method to reduce the effects of two-dimensional(2D)ambiguity.To solve this problem,we developed an unsupervised training framework for monocular face 3D reconstruction using rotational cycle consistency.Specifically,to learn more accurate facial information,we first used an autoencoder to factor the input images and applied these factors to generate normalized frontal views.We then proceeded through a differentiable renderer to use rotational consistency to continuously perceive refinement.Our method provided implicit multi-view consistency constraints on the pose and depth information estimation of the input face,and the performance was accurate and robust in the presence of large variations in expression and pose.In the benchmark tests,our method performed more stably and realistically than other methods that used 3D face reconstruction in monocular 2D images.展开更多
Background In this study, we propose view interpolation networks to reproduce changes in the brightness of an object′s surface depending on the viewing direction, which is important for reproducing the material appea...Background In this study, we propose view interpolation networks to reproduce changes in the brightness of an object′s surface depending on the viewing direction, which is important for reproducing the material appearance of a real object. Method We used an original and modified version of U-Net for image transformation. The networks were trained to generate images from the intermediate viewpoints of four cameras placed at the corners of a square. We conducted an experiment using with three different combinations of methods and training data formats. Result We determined that inputting the coordinates of the viewpoints together with the four camera images and using images from random viewpoints as the training data produces the best results.展开更多
Free-viewpoint video allows the user to view objects from any virtual perspective,creating an immersive visual experience.This technology enhances the interactivity and freedom of multimedia performances.However,many ...Free-viewpoint video allows the user to view objects from any virtual perspective,creating an immersive visual experience.This technology enhances the interactivity and freedom of multimedia performances.However,many free-viewpoint video synthesis methods hardly satisfy the requirement to work in real time with high precision,particularly for sports fields having large areas and numerous moving objects.To address these issues,we propose a freeviewpoint video synthesis method based on distance field acceleration.The central idea is to fuse multiview distance field information and use it to adjust the search step size adaptively.Adaptive step size search is used in two ways:for fast estimation of multiobject three-dimensional surfaces,and synthetic view rendering based on global occlusion judgement.We have implemented our ideas using parallel computing for interactive display,using CUDA and OpenGL frameworks,and have used real-world and simulated experimental datasets for evaluation.The results show that the proposed method can render free-viewpoint videos with multiple objects on large sports fields at 25 fps.Furthermore,the visual quality of our synthetic novel viewpoint images exceeds that of state-of-the-art neural-rendering-based methods.展开更多
Multiview video can provide more immersive perception than traditional single 2-D video. It enables both interactive free navigation applications as well as high-end autostereoscopic displays on which multiple users c...Multiview video can provide more immersive perception than traditional single 2-D video. It enables both interactive free navigation applications as well as high-end autostereoscopic displays on which multiple users can perceive genuine 3-D content without glasses. The multiview format also comprises much more visual information than classical 2-D or stereo 3-D content, which makes it possible to perform various interesting editing operations both on pixel-level and object-level. This survey provides a comprehensive review of existing multiview video synthesis and editing algorithms and applications. For each topic, the related technologies in classical 2-D image and video processing are reviewed. We then continue to the discussion of recent advanced techniques for multiview video virtual view synthesis and various interactive editing applications. Due to the ongoing progress on multiview video synthesis and editing, we can foresee more and more immersive 3-D video applications will appear in the future.展开更多
The existing depth video coding algorithms are generally based on in-loop depth filters, whose performance are unstable and easily affected by the outliers. In this paper, we design a joint weighted sparse representat...The existing depth video coding algorithms are generally based on in-loop depth filters, whose performance are unstable and easily affected by the outliers. In this paper, we design a joint weighted sparse representation-based median filter as the in-loop filter in depth video codec. It constructs depth candidate set which contains relevant neighboring depth pixel based on depth and intensity similarity weighted sparse coding, then the median operation is performed on this set to select a neighboring depth pixel as the result of the filtering. The experimental results indicate that the depth bitrate is reduced by about 9% compared with anchor method. It is confirmed that the proposed method is more effective in reducing the required depth bitrates for a given synthesis quality level.展开更多
The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position an...The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position and viewpoint-conditioned neural networks,3D Gaussian splatting utilizes a set of Gaussian ellipsoids to model the scene so that efficient rendering can be accomplished by rasterizing Gaussian ellipsoids into images.Apart from fast rendering,the explicit representation of 3D Gaussian splatting also facilitates downstream tasks like dynamic reconstruction,geometry editing,and physical simulation.Considering the rapid changes and growing number of works in this field,we present a literature review of recent 3D Gaussian splatting methods,which can be roughly classified by functionality into 3D reconstruction,3D editing,and other downstream applications.Traditional point-based rendering methods and the rendering formulation of 3D Gaussian splatting are also covered to aid understanding of this technique.This survey aims to help beginners to quickly get started in this field and to provide experienced researchers with a comprehensive overview,aiming to stimulate future development of the 3D Gaussian splatting representation.展开更多
Acquiring high-resolution light fields(LFs)is expensive.LF angular superresolution aims to synthesize the required number of views from a given sparse set of spatially high-resolution images.Existing methods struggle ...Acquiring high-resolution light fields(LFs)is expensive.LF angular superresolution aims to synthesize the required number of views from a given sparse set of spatially high-resolution images.Existing methods struggle with sparsely sampled LFs captured with large baselines.Some methods rely on depth estimation and view reprojection,and are sensitive to textureless and occluded regions.Other non-depth based methods suffer from aliasing or blurring effects due to the large disparity.In addition,most methods require specific models for different interpolation rates,which reduces their flexibility in practice.In this paper,we propose a learning framework that overcomes these challenges by exploiting the global and local structures of LFs.Our framework includes aggregation across both the angular and spatial dimensions to fully exploit the input data and a novel bilateral upsampling module that upsamples each epipolar plane image while better preserving its local parallax structure.Furthermore,our method predicts the weights of the interpolation filters based on both subpixel offset and range difference,allowing angular superresolution at different rates with a single model.We show that our non-depth based method outperforms the state-of-the-art methods in terms of handling large disparities and flexibility on both real-world and synthetic LF images.展开更多
基金Project (No. 2002CB312101) supported by the National Basic Re-search Program (973) of China
文摘This paper presents some techniques for synthesizing novel view for a virtual viewpoint from two given views cap-tured at different viewpoints to achieve both high quality and high efficiency. The whole process consists of three passes. The first pass recovers depth map. We formulate it as pixel labelling and propose a bisection approach to solve it. It is accomplished in log2n(n is the number of depth levels) steps,each of which involves a single graph cut computation. The second pass detects occluded pixels and reasons about their depth. It fits a foreground depth curve and a background depth curve using depth of nearby fore-ground and background pixels,and then distinguishes foreground and background pixels by minimizing a global energy,which involves only one graph cut computation. The third pass finds for each pixel in the novel view the corresponding pixels in the input views and computes its color. The whole process involves only a small number of graph cut computations,therefore it is efficient. And,visual artifacts in the synthesized view can be removed successfully by correcting depth of the occluded pixels. Experimental results demonstrate that both high quality and high efficiency are achieved by the proposed techniques.
文摘A new method is proposed for synthesizing intermediate views from a pair of stereoscopic images. In order to synthesize high-quality intermediate views, the block matching method together with a simplified multi-window technique and dynamic programming is used in the process of disparity estimation. Then occlusion detection is performed to locate occluded regions and their disparities are compensated. After the projecton of the left-to-right and right-to-left disparities onto the intermediate image, intermediate view is synthesized considering occluded regions. Experimental results show that our synthesis method can obtain intermediate views with higher quality.
文摘A new method of view synthesis is proposed based on Delaunay triangulation. The first step of this method is making the Delaunay triangulation of 2 reference images. Secondly, matching the image points using the epipolar geometry constraint. Finally, constructing the third view according to pixel transferring under the trilinear constraint. The method gets rid of the classic time consuming dense matching technique and takes advantage of Delaunay triangulation. So it can not only save the computation time but also enhance the quality of the synthesized view. The significance of this method is that it can be used directly in the fields of video coding, image compressing and virtual reality.
文摘For the pre-acquired serial images from camera lengthways motion, a view synthesis algorithm based on epipolar geometry constraint is proposed in this paper. It uses the whole matching and maintaining order characters of the epipolar line, Fourier transform and dynamic programming matching theories, thus truly synthesizing the destination image of current viewpoint. Through the combination of Fourier transform, epipolar geometry constraint and dynamic programming matching, the circumference distortion problem resulting from conventional view synthesis approaches is effectively avoided. The detailed implementation steps of this algorithm are given, and some running instances are presented to illustrate the results.
基金supported by the National Natural Science Foundation of China(Grant No.60832003)Key Laboratory of Advanced Display and System Application(Shanghai University),Ministry of Education,China(Grant No.P200902)the Key Project of Science and Technology Commission of Shanghai Municipality(Grant No.10510500500)
文摘Depth maps are used for synthesis virtual view in free-viewpoint television (FTV) systems. When depth maps are derived using existing depth estimation methods, the depth distortions will cause undesirable artifacts in the synthesized views. To solve this problem, a 3D video quality model base depth maps (D-3DV) for virtual view synthesis and depth map coding in the FTV applications is proposed. First, the relationships between distortions in coded depth map and rendered view are derived. Then, a precisely 3DV quality model based depth characteristics is develop for the synthesized virtual views. Finally, based on D-3DV model, a multilateral filtering is applied as a pre-processed filter to reduce rendering artifacts. The experimental results evaluated by objective and subjective methods indicate that the proposed D-3DV model can reduce bit-rate of depth coding and achieve better rendering quality.
基金supported by the National Natural Science Foundation of China(61075013)
文摘View synthesis is an important building block in three dimension(3D) video processing and communications.Based on one or several views,view synthesis creates other views for the purpose of view prediction(for compression) or view rendering(for multiview-display).The quality of view synthesis depends on how one fills the occlusion area as well as how the pixels are created.Consequently,luminance adjustment and hole filling are two key issues in view synthesis.In this paper,two views are used to produce an arbitrary virtual synthesized view.One view is merged into another view using a local luminance adjustment method,based on local neighborhood region for the calculation of adjustment coefficient.Moreover,a maximum neighborhood spreading strength hole filling method is presented to deal with the micro texture structure when the hole is being filled.For each pixel at the hole boundary,its neighborhood pixels with the maximum spreading strength direction are selected as candidates;and among them,the pixel with the maximum spreading strength is used to fill the hole from boundary to center.If there still exist disocclusion pixels after once scan,the filling process is repeated until all hole pixels are filled.Simulation results show that the proposed method is efficient,robust and achieves high performance in subjection and objection.
基金supported by the Zhengzhou Collaborative Innovation Major Project under Grant No.20XTZX06013the Henan Provincial Key Scientific Research Project of China under Grant No.22A520042。
文摘Traditional neural radiance fields for rendering novel views require intensive input images and pre-scene optimization,which limits their practical applications.We propose a generalization method to infer scenes from input images and perform high-quality rendering without pre-scene optimization named SG-NeRF(Sparse-Input Generalized Neural Radiance Fields).Firstly,we construct an improved multi-view stereo structure based on the convolutional attention and multi-level fusion mechanism to obtain the geometric features and appearance features of the scene from the sparse input images,and then these features are aggregated by multi-head attention as the input of the neural radiance fields.This strategy of utilizing neural radiance fields to decode scene features instead of mapping positions and orientations enables our method to perform cross-scene training as well as inference,thus enabling neural radiance fields to generalize for novel view synthesis on unseen scenes.We tested the generalization ability on DTU dataset,and our PSNR(peak signal-to-noise ratio)improved by 3.14 compared with the baseline method under the same input conditions.In addition,if the scene has dense input views available,the average PSNR can be improved by 1.04 through further refinement training in a short time,and a higher quality rendering effect can be obtained.
基金supported by the National Natural Science Foundation of China under Grants 62125102the Beijing Natural Science Foundation under Grant JL23005the Fundamental Research Funds for the Central Universities
文摘Novel View Synthesis(NVS) is an important task for 3D interpretation in remote sensing scenes, which also benefits vicinagearth security by enhancing situational awareness capabilities. Recently, NVS methods based on Neural Radiance Fields(Ne RFs) have attracted increasing attention for self-supervised training and highly photo-realistic synthesis results.However, it is still challenging to synthesize novel view images in remote sensing scenes, given the complexity of land covers and the sparsity of input multi-view images. In this paper, we propose a novel NVS method named FRe SNe RF, which combines Image-Based Rendering(IBR) and Ne RF to achieve high-quality results in remote sensing scenes with sparse input. We effectively solve the degradation problem by adopting the sampling space annealing method.Additionally, we introduce depth smoothness based on the segmentation mask to constrain the scene geometry. Experiments on multiple scenes show the superiority of our proposed FRe SNe RF over other methods.
基金supported in part by National Natural Science Foundation of China under Grant No.62473013Key Project of Science and Technology Innovation and Entrepreneurship of TDTEC(No.2022-TDZD004).
文摘The newly emerging neural radiance fields(NeRF)methods can implicitly fulfill three-dimensional(3D)reconstruction via training a neural network to render novel-view images of a given scene with given posed images.The Instant Neural Graphics Primitives(Instant-NGP)method further improves the position encoding of NeRF.It obtains state-of-the-art efficiency.However,only a local pixel-wised loss is considered when training the Instant-NGP while overlooking the nonlocal structural information between pixels.Despite a good quantitative result,it leads to a poor visual effect,especially the completeness.Inspired by the stochastic structural similarity(S3IM)method that exploits nonlocal structural information of groups of pixels,this paper proposes a new method to improve the completeness of fast novel view synthesis.The proposed method first extends the thread-wised processing of the Instant-NGP to the processing in a customthread block(i.e.,a group of threads).Then,the relative dimensionless global error in synthesis,i.e.,Erreur Relative Globale Adimensionnelle de Synthese(ERGAS),of a group of pixels corresponding to a group of threads is computed and incorporated into the loss function.Extensive experiments validate the proposed method.It can obtain better quantitative results than the original Instant-NGP with fewer iteration steps.PSNR is increased by 1%.Amazing qualitative results are obtained,especially for delicate structures and details such as lines and continuous structures.With the dramatic improvements in the visual effects,our method can boost the practicability of implicit 3D reconstruction in applications such as self-driving and augmented reality.
基金supported by the National Natural Science Foundation of China under Grant Nos.62072284 and 61772318the Special Project of Science and Technology Innovation Base of Key Laboratory of Shandong Province for Software Engineering under Grant No.11480004042015。
文摘Indoor visual localization,i.e.,6 Degree-of-Freedom camera pose estimation for a query image with respect to a known scene,is gaining increased attention driven by rapid progress of applications such as robotics and augmented reality.However,drastic visual discrepancies between an onsite query image and prerecorded indoor images cast a significant challenge for visual localization.In this paper,based on the key observation of the constant existence of planar surfaces such as floors or walls in indoor scenes,we propose a novel system incorporating geometric information to address issues using only pixelated images.Through the system implementation,we contribute a hierarchical structure consisting of pre-scanned images and point cloud,as well as a distilled representation of the planar-element layout extracted from the original dataset.A view synthesis procedure is designed to generate synthetic images as complementary to that of a sparsely sampled dataset.Moreover,a global image descriptor based on the image statistic modality,called block mean,variance,and color(BMVC),was employed to speed up the candidate pose identification incorporated with a traditional convolutional neural network(CNN)descriptor.Experimental results on a popular benchmark demonstrate that the proposed method outperforms the state-of-the-art approaches in terms of visual localization validity and accuracy.
基金This work was supported in part by the National Natural Science Foundation of China(62171317 and 62122058).
文摘Novel viewpoint image synthesis is very challenging,especially from sparse views,due to large changes in viewpoint and occlusion.Existing image-based methods fail to generate reasonable results for invisible regions,while geometry-based methods have difficulties in synthesizing detailed textures.In this paper,we propose STATE,an end-to-end deep neural network,for sparse view synthesis by learning structure and texture representations.Structure is encoded as a hybrid feature field to predict reasonable structures for invisible regions while maintaining original structures for visible regions,and texture is encoded as a deformed feature map to preserve detailed textures.We propose a hierarchical fusion scheme with intra-branch and inter-branch aggregation,in which spatio-view attention allows multi-view fusion at the feature level to adaptively select important information by regressing pixel-wise or voxel-wise confidence maps.By decoding the aggregated features,STATE is able to generate realistic images with reasonable structures and detailed textures.Experimental results demonstrate that our method achieves qualitatively and quantitatively better results than state-of-the-art methods.Our method also enables texture and structure editing applications benefiting from implicit disentanglement of structure and texture.Our code is available at http://cic.tju.edu.cn/faculty/likun/projects/STATE.
基金supported by the National key foundation for exploring scientific instrument(2013YQ140517)the National Natural Science Foundation of China(Grant No.61522111)the Shenzhen Peacock Plan(KQTD20140630115140843).
文摘Typical stereo algorithms treat disparity estimation and view synthesis as two sequential procedures.In this paper,we consider stereo matching and view synthesis as two complementary components,and present a novel iterative refinement model for joint view synthesis and disparity refinement.To achieve the mutual promotion between view synthesis and disparity refinement,we apply two key strategies,disparity maps fusion and disparity-assisted plane sweep-based rendering(DAPSR).On the one hand,the disparity maps fusion strategy is applied to generate disparity map from synthesized view and input views.This strategy is able to detect and counteract disparity errors caused by potential artifacts from synthesized view.On the other hand,the DAPSR is used for view synthesis and updating,and is able to weaken the interpolation errors caused by outliers in the disparity maps.Experiments on Middlebury benchmarks demonstrate that by introducing the synthesized view,disparity errors due to large occluded region and large baseline are eliminated effectively and the synthesis quality is greatly improved.
文摘In recent years,the concept of digital human has attracted widespread attention from all walks of life,and the modelling of high-fidelity human bodies,heads,and hands has been intensively studied.This paper focuses on head modelling and proposes a generic head parametric model based on neural radiance fields.Specifically,we first use face recognition networks and 3D facial expression database FaceWarehouse to parameterize identity and expression semantics,respectively,and use both as conditional inputs to build a neural radiance field for the human head,thereby improving the head model’s representation ability while ensuring editing capabilities for the identity and expression of the rendered results;then,through a combination of volume rendering and neural rendering,the 3D representation of the head is rapidly rendered into the 2D plane,producing a high-fidelity image of the human head.Thanks to the well-designed loss functions and good implicit representation of the neural radiance field,our model can not only edit the identity and expression independently,but also freely modify the virtual camera position of the rendering results.It has excellent multi-view consistency,and has many applications in novel view synthesis,pose driving and more.
基金Supported by Science and Technology Department Major Innovation Special Fund of Hubei Province of China(2020BAB116)。
文摘With the popularity of the digital human body,monocular three-dimensional(3D)face reconstruction is widely used in fields such as animation and face recognition.Although current methods trained using single-view image sets perform well in monocular 3D face reconstruction tasks,they tend to rely on the constraints of the a priori model or the appearance conditions of the input images,fundamentally because of the inability to propose an effective method to reduce the effects of two-dimensional(2D)ambiguity.To solve this problem,we developed an unsupervised training framework for monocular face 3D reconstruction using rotational cycle consistency.Specifically,to learn more accurate facial information,we first used an autoencoder to factor the input images and applied these factors to generate normalized frontal views.We then proceeded through a differentiable renderer to use rotational consistency to continuously perceive refinement.Our method provided implicit multi-view consistency constraints on the pose and depth information estimation of the input face,and the performance was accurate and robust in the presence of large variations in expression and pose.In the benchmark tests,our method performed more stably and realistically than other methods that used 3D face reconstruction in monocular 2D images.
文摘Background In this study, we propose view interpolation networks to reproduce changes in the brightness of an object′s surface depending on the viewing direction, which is important for reproducing the material appearance of a real object. Method We used an original and modified version of U-Net for image transformation. The networks were trained to generate images from the intermediate viewpoints of four cameras placed at the corners of a square. We conducted an experiment using with three different combinations of methods and training data formats. Result We determined that inputting the coordinates of the viewpoints together with the four camera images and using images from random viewpoints as the training data produces the best results.
基金supported by the National Natural Science Foundation of China(Nos.62172315,62073262,and 61672429)the Fundamental Research Funds for the Central Universities,the Innovation Fund of Xidian University(No.20109205456)the Key Research and Development Program of Shaanxi(No.S2021-YF-ZDCXL-ZDLGY-0127),and HUAWEI.
文摘Free-viewpoint video allows the user to view objects from any virtual perspective,creating an immersive visual experience.This technology enhances the interactivity and freedom of multimedia performances.However,many free-viewpoint video synthesis methods hardly satisfy the requirement to work in real time with high precision,particularly for sports fields having large areas and numerous moving objects.To address these issues,we propose a freeviewpoint video synthesis method based on distance field acceleration.The central idea is to fuse multiview distance field information and use it to adjust the search step size adaptively.Adaptive step size search is used in two ways:for fast estimation of multiobject three-dimensional surfaces,and synthetic view rendering based on global occlusion judgement.We have implemented our ideas using parallel computing for interactive display,using CUDA and OpenGL frameworks,and have used real-world and simulated experimental datasets for evaluation.The results show that the proposed method can render free-viewpoint videos with multiple objects on large sports fields at 25 fps.Furthermore,the visual quality of our synthetic novel viewpoint images exceeds that of state-of-the-art neural-rendering-based methods.
基金partially supported by Innoviris(3-DLicornea project)FWO(project G.0256.15)+3 种基金supported by the National Natural Science Foundation of China(Nos.61272226 and 61373069)Research Grant of Beijing Higher Institution Engineering Research CenterTsinghua-Tencent Joint Laboratory for Internet Innovation TechnologyTsinghua University Initiative Scientific Research Program
文摘Multiview video can provide more immersive perception than traditional single 2-D video. It enables both interactive free navigation applications as well as high-end autostereoscopic displays on which multiple users can perceive genuine 3-D content without glasses. The multiview format also comprises much more visual information than classical 2-D or stereo 3-D content, which makes it possible to perform various interesting editing operations both on pixel-level and object-level. This survey provides a comprehensive review of existing multiview video synthesis and editing algorithms and applications. For each topic, the related technologies in classical 2-D image and video processing are reviewed. We then continue to the discussion of recent advanced techniques for multiview video virtual view synthesis and various interactive editing applications. Due to the ongoing progress on multiview video synthesis and editing, we can foresee more and more immersive 3-D video applications will appear in the future.
基金Supported by the National Natural Science Foundation of China(61462048)
文摘The existing depth video coding algorithms are generally based on in-loop depth filters, whose performance are unstable and easily affected by the outliers. In this paper, we design a joint weighted sparse representation-based median filter as the in-loop filter in depth video codec. It constructs depth candidate set which contains relevant neighboring depth pixel based on depth and intensity similarity weighted sparse coding, then the median operation is performed on this set to select a neighboring depth pixel as the result of the filtering. The experimental results indicate that the depth bitrate is reduced by about 9% compared with anchor method. It is confirmed that the proposed method is more effective in reducing the required depth bitrates for a given synthesis quality level.
基金supported by the National Natural Science Foundation of China(62322210)Beijing Municipal Natural Science Foundation for Distinguished Young Scholars(JQ21013)+1 种基金Beijing Municipal Science and Technology Commission(Z231100005923031)2023 Tencent AI Lab Rhino-Bird Focused Research Program.
文摘The emergence of 3D Gaussian splatting(3DGS)has greatly accelerated rendering in novel view synthesis.Unlike neural implicit representations like neural radiance fields(NeRFs)that represent a 3D scene with position and viewpoint-conditioned neural networks,3D Gaussian splatting utilizes a set of Gaussian ellipsoids to model the scene so that efficient rendering can be accomplished by rasterizing Gaussian ellipsoids into images.Apart from fast rendering,the explicit representation of 3D Gaussian splatting also facilitates downstream tasks like dynamic reconstruction,geometry editing,and physical simulation.Considering the rapid changes and growing number of works in this field,we present a literature review of recent 3D Gaussian splatting methods,which can be roughly classified by functionality into 3D reconstruction,3D editing,and other downstream applications.Traditional point-based rendering methods and the rendering formulation of 3D Gaussian splatting are also covered to aid understanding of this technique.This survey aims to help beginners to quickly get started in this field and to provide experienced researchers with a comprehensive overview,aiming to stimulate future development of the 3D Gaussian splatting representation.
基金supported by the National Natural Science Foundation of China(No.62001432)the Fundamental Research Funds for the Central Universities(No.CUC18LG024).
文摘Acquiring high-resolution light fields(LFs)is expensive.LF angular superresolution aims to synthesize the required number of views from a given sparse set of spatially high-resolution images.Existing methods struggle with sparsely sampled LFs captured with large baselines.Some methods rely on depth estimation and view reprojection,and are sensitive to textureless and occluded regions.Other non-depth based methods suffer from aliasing or blurring effects due to the large disparity.In addition,most methods require specific models for different interpolation rates,which reduces their flexibility in practice.In this paper,we propose a learning framework that overcomes these challenges by exploiting the global and local structures of LFs.Our framework includes aggregation across both the angular and spatial dimensions to fully exploit the input data and a novel bilateral upsampling module that upsamples each epipolar plane image while better preserving its local parallax structure.Furthermore,our method predicts the weights of the interpolation filters based on both subpixel offset and range difference,allowing angular superresolution at different rates with a single model.We show that our non-depth based method outperforms the state-of-the-art methods in terms of handling large disparities and flexibility on both real-world and synthetic LF images.