Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To addre...Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To address this,we present SCENET-3D,a transformer-drivenmultimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline.In the first stage,scene analysis,rich geometric and texture descriptors are extracted from RGB frames,including surface-normal histograms,angles between neighboring normals,Zernike moments,directional standard deviation,and Gabor-filter responses.In the second stage,scene-object analysis,non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information.In the third stage,human-pose estimation,silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints,which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons.Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities.Experiments on UCLA(95.8%),ETRI-Activity3D(89.4%),andCAD-120(91.2%)demonstrate that combining pseudo-3D skeletonswith rich scene-object fusion significantly improves generalizable activity recognition,enabling safer elderly care,natural human–robot interaction,and robust context-aware robotic perception in real-world environments.展开更多
Human–object interaction(HOI)detection is crucial for human-centric image understanding which aims to infer human,action,object triplets within an image.Recent studies often exploit visual features and the spatial co...Human–object interaction(HOI)detection is crucial for human-centric image understanding which aims to infer human,action,object triplets within an image.Recent studies often exploit visual features and the spatial configuration of a human–object pair in order to learn the action linking the human and object in the pair.We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level,but also at the part level at which a body part interacts with an object,and at the semantic level by considering the semantic label of an object along with human appearance and human–object spatial configuration,to infer the action.We thus propose a multi-level pairwise feature network(PFNet)for detecting human–object interactions.The network consists of three parallel streams to characterize HOI utilizing pairwise features at the above three levels;the three streams are finally fused to give the action prediction.Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the VCOCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.展开更多
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R410),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To address this,we present SCENET-3D,a transformer-drivenmultimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline.In the first stage,scene analysis,rich geometric and texture descriptors are extracted from RGB frames,including surface-normal histograms,angles between neighboring normals,Zernike moments,directional standard deviation,and Gabor-filter responses.In the second stage,scene-object analysis,non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information.In the third stage,human-pose estimation,silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints,which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons.Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities.Experiments on UCLA(95.8%),ETRI-Activity3D(89.4%),andCAD-120(91.2%)demonstrate that combining pseudo-3D skeletonswith rich scene-object fusion significantly improves generalizable activity recognition,enabling safer elderly care,natural human–robot interaction,and robust context-aware robotic perception in real-world environments.
基金supported by the National Natural Science Foundation of China(Project No.61902210),a Research Grant of Beijing Higher Institution Engineering Research Center,and the Tsinghua–Tencent Joint Laboratory for Internet Innovation Technology.
文摘Human–object interaction(HOI)detection is crucial for human-centric image understanding which aims to infer human,action,object triplets within an image.Recent studies often exploit visual features and the spatial configuration of a human–object pair in order to learn the action linking the human and object in the pair.We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level,but also at the part level at which a body part interacts with an object,and at the semantic level by considering the semantic label of an object along with human appearance and human–object spatial configuration,to infer the action.We thus propose a multi-level pairwise feature network(PFNet)for detecting human–object interactions.The network consists of three parallel streams to characterize HOI utilizing pairwise features at the above three levels;the three streams are finally fused to give the action prediction.Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the VCOCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.