The ability to recognise video events has become increasingly more popular owing to its extensive practical applications.Most events will occur in certain scene with certain people,and the scene context and group cont...The ability to recognise video events has become increasingly more popular owing to its extensive practical applications.Most events will occur in certain scene with certain people,and the scene context and group context provide important information for event recognition.In this paper,we present an algorithm to recognise video events in different scenes in which there are multiple agents.First,we recognise events for each agent based on Stochastic Context Sensitive Grammar(SCSG).Then we propose the model of a scene in order to infer the scene in which the events occur,and we use a co-occurrence matrix of events to represent the group context.Finally,the scene and group context are exploited to distinguish events having similar structures.Experimental results show that by adding the scene and group context,the performance of events recognition can be significantly improved.展开更多
Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To addre...Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To address this,we present SCENET-3D,a transformer-drivenmultimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline.In the first stage,scene analysis,rich geometric and texture descriptors are extracted from RGB frames,including surface-normal histograms,angles between neighboring normals,Zernike moments,directional standard deviation,and Gabor-filter responses.In the second stage,scene-object analysis,non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information.In the third stage,human-pose estimation,silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints,which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons.Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities.Experiments on UCLA(95.8%),ETRI-Activity3D(89.4%),andCAD-120(91.2%)demonstrate that combining pseudo-3D skeletonswith rich scene-object fusion significantly improves generalizable activity recognition,enabling safer elderly care,natural human–robot interaction,and robust context-aware robotic perception in real-world environments.展开更多
基金partially supported by the National Natural Science Foundation of China under Grant No.61203291the Specialised Research Fund for the Doctoral Program under Grant No.20121101110035
文摘The ability to recognise video events has become increasingly more popular owing to its extensive practical applications.Most events will occur in certain scene with certain people,and the scene context and group context provide important information for event recognition.In this paper,we present an algorithm to recognise video events in different scenes in which there are multiple agents.First,we recognise events for each agent based on Stochastic Context Sensitive Grammar(SCSG).Then we propose the model of a scene in order to infer the scene in which the events occur,and we use a co-occurrence matrix of events to represent the group context.Finally,the scene and group context are exploited to distinguish events having similar structures.Experimental results show that by adding the scene and group context,the performance of events recognition can be significantly improved.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R410),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To address this,we present SCENET-3D,a transformer-drivenmultimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline.In the first stage,scene analysis,rich geometric and texture descriptors are extracted from RGB frames,including surface-normal histograms,angles between neighboring normals,Zernike moments,directional standard deviation,and Gabor-filter responses.In the second stage,scene-object analysis,non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information.In the third stage,human-pose estimation,silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints,which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons.Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities.Experiments on UCLA(95.8%),ETRI-Activity3D(89.4%),andCAD-120(91.2%)demonstrate that combining pseudo-3D skeletonswith rich scene-object fusion significantly improves generalizable activity recognition,enabling safer elderly care,natural human–robot interaction,and robust context-aware robotic perception in real-world environments.