In recent years,addressing ill-posed problems by leveraging prior knowledge contained in databases on learning techniques has gained much attention.In this paper,we focus on complete three-dimensional(3D)point cloud r...In recent years,addressing ill-posed problems by leveraging prior knowledge contained in databases on learning techniques has gained much attention.In this paper,we focus on complete three-dimensional(3D)point cloud reconstruction based on a single red-green-blue(RGB)image,a task that cannot be approached using classical reconstruction techniques.For this purpose,we used an encoder-decoder framework to encode the RGB information in latent space,and to predict the 3D structure of the considered object from different viewpoints.The individual predictions are combined to yield a common representation that is used in a module combining camera pose estimation and rendering,thereby achieving differentiability with respect to imaging process and the camera pose,and optimization of the two-dimensional prediction error of novel viewpoints.Thus,our method allows end-to-end training and does not require supervision based on additional ground-truth(GT)mask annotations or ground-truth camera pose annotations.Our evaluation of synthetic and real-world data demonstrates the robustness of our approach to appearance changes and self-occlusions,through outperformance of current state-of-the-art methods in terms of accuracy,density,and model completeness.展开更多
Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To addre...Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To address this,we present SCENET-3D,a transformer-drivenmultimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline.In the first stage,scene analysis,rich geometric and texture descriptors are extracted from RGB frames,including surface-normal histograms,angles between neighboring normals,Zernike moments,directional standard deviation,and Gabor-filter responses.In the second stage,scene-object analysis,non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information.In the third stage,human-pose estimation,silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints,which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons.Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities.Experiments on UCLA(95.8%),ETRI-Activity3D(89.4%),andCAD-120(91.2%)demonstrate that combining pseudo-3D skeletonswith rich scene-object fusion significantly improves generalizable activity recognition,enabling safer elderly care,natural human–robot interaction,and robust context-aware robotic perception in real-world environments.展开更多
This study proposes a framework for extracting automatic guided vehicle(AGV)kinematic information from port-like videos,which provides a solution for situation awareness of port surveillance videos.Firstly,vehicle pix...This study proposes a framework for extracting automatic guided vehicle(AGV)kinematic information from port-like videos,which provides a solution for situation awareness of port surveillance videos.Firstly,vehicle pixel-wise positions in port-like videos are determined by the visual object tracking(SeqTrack)model.Secondly,the extrinsic parameters of the query images are estimated by the generalizable model-free 6-DoF object(Gen6D)pose estimation method.More specifically,a point cloud of AGV is reconstructed with multi-view AGV reference images and the image extrinsic parameters are obtained through structure form motion.The reference image which has the most similar viewpoint as the query image is identified with the Gen6D selection module;as a result,the extrinsic parameters of the reference image can be used to estimate the extrinsic parameters of the query image.After that,the extrinsic parameters of the query image are identified with the support of the Gen6D refinement module.Thirdly,we obtain vehicle displacement by mapping the vehicle point cloud coordinate into the camera coordinate,and then we estimate the vehicle movement information with the help of the generative adversarial network model.Experimental results suggest that the pose estimation metrics average discrepancy distance and 2D re-projection error of our method reach 0.76 and 0.75,respectively.The mean absolute error and root mean squared error of the estimated vehicle displacement reach 0.023,0.030 for scene#1(i.e.,AGV moves along x-axis of camera coordinate)and 0.182,0.298 for scene#2(i.e.,camera follows AGV to move along x-axis of camera coordinate).展开更多
In order to get the entire data in the optical measurement, a multi-view three-dimensional(3D) measurement method based on turntable is proposed. In the method, a turntable is used to rotate the object and obtain mult...In order to get the entire data in the optical measurement, a multi-view three-dimensional(3D) measurement method based on turntable is proposed. In the method, a turntable is used to rotate the object and obtain multi-view point cloud data, and then multi-view point cloud data are registered and integrated into a 3D model. The measurement results are compared with that of the sticking marked point method. Experimental results show that the measurement process of the proposed method is simpler, and the scanning speed and accuracy are improved.展开更多
Recently, neural implicit function-basedrepresentation has attracted more and more attention,and has been widely used to represent surfacesusing differentiable neural networks. However, surfacereconstruction from poin...Recently, neural implicit function-basedrepresentation has attracted more and more attention,and has been widely used to represent surfacesusing differentiable neural networks. However, surfacereconstruction from point clouds or multi-view imagesusing existing neural geometry representations stillsuffer from slow computation and poor accuracy. Toalleviate these issues, we propose a multi-scale hashencoding-based neural geometry representation whicheffectively and efficiently represents the surface asa signed distance field. Our novel neural networkstructure carefully combines low-frequency Fourierposition encoding with multi-scale hash encoding. Theinitialization of the geometry network and geometryfeatures of the rendering module are accordinglyredesigned. Our experiments demonstrate that theproposed representation is at least 10 times faster forreconstructing point clouds with millions of points.It also significantly improves speed and accuracyof multi-view reconstruction. Our code and modelsare available at https://github.com/Dengzhi-USTC/Neural-Geometry-Reconstruction.展开更多
基金Supported by National Natural Science Foundation of China(Grant No.51935003).
文摘In recent years,addressing ill-posed problems by leveraging prior knowledge contained in databases on learning techniques has gained much attention.In this paper,we focus on complete three-dimensional(3D)point cloud reconstruction based on a single red-green-blue(RGB)image,a task that cannot be approached using classical reconstruction techniques.For this purpose,we used an encoder-decoder framework to encode the RGB information in latent space,and to predict the 3D structure of the considered object from different viewpoints.The individual predictions are combined to yield a common representation that is used in a module combining camera pose estimation and rendering,thereby achieving differentiability with respect to imaging process and the camera pose,and optimization of the two-dimensional prediction error of novel viewpoints.Thus,our method allows end-to-end training and does not require supervision based on additional ground-truth(GT)mask annotations or ground-truth camera pose annotations.Our evaluation of synthetic and real-world data demonstrates the robustness of our approach to appearance changes and self-occlusions,through outperformance of current state-of-the-art methods in terms of accuracy,density,and model completeness.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R410),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To address this,we present SCENET-3D,a transformer-drivenmultimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline.In the first stage,scene analysis,rich geometric and texture descriptors are extracted from RGB frames,including surface-normal histograms,angles between neighboring normals,Zernike moments,directional standard deviation,and Gabor-filter responses.In the second stage,scene-object analysis,non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information.In the third stage,human-pose estimation,silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints,which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons.Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities.Experiments on UCLA(95.8%),ETRI-Activity3D(89.4%),andCAD-120(91.2%)demonstrate that combining pseudo-3D skeletonswith rich scene-object fusion significantly improves generalizable activity recognition,enabling safer elderly care,natural human–robot interaction,and robust context-aware robotic perception in real-world environments.
基金supported by the National Natural Science Foundation of China(Grant Nos.52472347 and 52331012)the Open Fund of Chongqing Key Laboratory of Green Logistics Intelligent Technology(Chongqing Jiaotong University)(Grant No.KLGLIT2024ZD001).
文摘This study proposes a framework for extracting automatic guided vehicle(AGV)kinematic information from port-like videos,which provides a solution for situation awareness of port surveillance videos.Firstly,vehicle pixel-wise positions in port-like videos are determined by the visual object tracking(SeqTrack)model.Secondly,the extrinsic parameters of the query images are estimated by the generalizable model-free 6-DoF object(Gen6D)pose estimation method.More specifically,a point cloud of AGV is reconstructed with multi-view AGV reference images and the image extrinsic parameters are obtained through structure form motion.The reference image which has the most similar viewpoint as the query image is identified with the Gen6D selection module;as a result,the extrinsic parameters of the reference image can be used to estimate the extrinsic parameters of the query image.After that,the extrinsic parameters of the query image are identified with the support of the Gen6D refinement module.Thirdly,we obtain vehicle displacement by mapping the vehicle point cloud coordinate into the camera coordinate,and then we estimate the vehicle movement information with the help of the generative adversarial network model.Experimental results suggest that the pose estimation metrics average discrepancy distance and 2D re-projection error of our method reach 0.76 and 0.75,respectively.The mean absolute error and root mean squared error of the estimated vehicle displacement reach 0.023,0.030 for scene#1(i.e.,AGV moves along x-axis of camera coordinate)and 0.182,0.298 for scene#2(i.e.,camera follows AGV to move along x-axis of camera coordinate).
基金supported by the National Natural Science Foundation of China(Nos.60808020 and 61078041)the Natural Science Foundation of Tianjin City(Nos.15JCYBJC51700 and 16JCYBJC15400)the National Science and Technology Support(No.2014BAH03F01)
文摘In order to get the entire data in the optical measurement, a multi-view three-dimensional(3D) measurement method based on turntable is proposed. In the method, a turntable is used to rotate the object and obtain multi-view point cloud data, and then multi-view point cloud data are registered and integrated into a 3D model. The measurement results are compared with that of the sticking marked point method. Experimental results show that the measurement process of the proposed method is simpler, and the scanning speed and accuracy are improved.
基金supported by the National Natural Science Foundation of China(Nos.62122071 and 62272433)the Fundamental Research Funds for the Central Universities(No.WK3470000021)the Alibaba Innovation Research Program(AIR).
文摘Recently, neural implicit function-basedrepresentation has attracted more and more attention,and has been widely used to represent surfacesusing differentiable neural networks. However, surfacereconstruction from point clouds or multi-view imagesusing existing neural geometry representations stillsuffer from slow computation and poor accuracy. Toalleviate these issues, we propose a multi-scale hashencoding-based neural geometry representation whicheffectively and efficiently represents the surface asa signed distance field. Our novel neural networkstructure carefully combines low-frequency Fourierposition encoding with multi-scale hash encoding. Theinitialization of the geometry network and geometryfeatures of the rendering module are accordinglyredesigned. Our experiments demonstrate that theproposed representation is at least 10 times faster forreconstructing point clouds with millions of points.It also significantly improves speed and accuracyof multi-view reconstruction. Our code and modelsare available at https://github.com/Dengzhi-USTC/Neural-Geometry-Reconstruction.