The ability to recognise video events has become increasingly more popular owing to its extensive practical applications.Most events will occur in certain scene with certain people,and the scene context and group cont...The ability to recognise video events has become increasingly more popular owing to its extensive practical applications.Most events will occur in certain scene with certain people,and the scene context and group context provide important information for event recognition.In this paper,we present an algorithm to recognise video events in different scenes in which there are multiple agents.First,we recognise events for each agent based on Stochastic Context Sensitive Grammar(SCSG).Then we propose the model of a scene in order to infer the scene in which the events occur,and we use a co-occurrence matrix of events to represent the group context.Finally,the scene and group context are exploited to distinguish events having similar structures.Experimental results show that by adding the scene and group context,the performance of events recognition can be significantly improved.展开更多
目标检测是遥感影像解译当中最重要的任务之一。当前,基于深度学习的遥感目标检测模型大多依赖于预定义的锚框,且往往忽略了场景中的上下文信息,导致检测性能和泛化能力受限。基于此,本文提出了一种面向遥感影像目标检测的场景关联无锚...目标检测是遥感影像解译当中最重要的任务之一。当前,基于深度学习的遥感目标检测模型大多依赖于预定义的锚框,且往往忽略了场景中的上下文信息,导致检测性能和泛化能力受限。基于此,本文提出了一种面向遥感影像目标检测的场景关联无锚框YOLO网络(Scene Related Anchor-Free YOLO,SRAF-YOLO)。SRAF-YOLO首先引入了一种场景增强的多尺度特征提取模块,通过将场景特征与目标特征融合,生成富含场景上下文信息的场景增强特征,并进一步利用多尺度操作提取包含场景语义的多尺度特征,有效引入场景上下文信息。在此基础上,设计了一种场景辅助无锚框检测头,利用特征图中的场景信息对目标类别预测进行约束,以提升检测精度,同时无锚框结构有效减少了锚框相关参数的计算量。在RSOD和NWPU VHR-10数据集上的实验结果表明,SRAF-YOLO通过融合场景信息和无锚框机制提升了目标检测精度,平均精度均值(mAP)分别达到94.58%和95.95%,相较于基线模型YOLOv8分别提升了1.51%和3.0%,并优于其他对比方法。在外部数据集上的验证结果进一步证实,该算法具备良好的泛化能力。展开更多
Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To addre...Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To address this,we present SCENET-3D,a transformer-drivenmultimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline.In the first stage,scene analysis,rich geometric and texture descriptors are extracted from RGB frames,including surface-normal histograms,angles between neighboring normals,Zernike moments,directional standard deviation,and Gabor-filter responses.In the second stage,scene-object analysis,non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information.In the third stage,human-pose estimation,silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints,which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons.Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities.Experiments on UCLA(95.8%),ETRI-Activity3D(89.4%),andCAD-120(91.2%)demonstrate that combining pseudo-3D skeletonswith rich scene-object fusion significantly improves generalizable activity recognition,enabling safer elderly care,natural human–robot interaction,and robust context-aware robotic perception in real-world environments.展开更多
基金partially supported by the National Natural Science Foundation of China under Grant No.61203291the Specialised Research Fund for the Doctoral Program under Grant No.20121101110035
文摘The ability to recognise video events has become increasingly more popular owing to its extensive practical applications.Most events will occur in certain scene with certain people,and the scene context and group context provide important information for event recognition.In this paper,we present an algorithm to recognise video events in different scenes in which there are multiple agents.First,we recognise events for each agent based on Stochastic Context Sensitive Grammar(SCSG).Then we propose the model of a scene in order to infer the scene in which the events occur,and we use a co-occurrence matrix of events to represent the group context.Finally,the scene and group context are exploited to distinguish events having similar structures.Experimental results show that by adding the scene and group context,the performance of events recognition can be significantly improved.
文摘目标检测是遥感影像解译当中最重要的任务之一。当前,基于深度学习的遥感目标检测模型大多依赖于预定义的锚框,且往往忽略了场景中的上下文信息,导致检测性能和泛化能力受限。基于此,本文提出了一种面向遥感影像目标检测的场景关联无锚框YOLO网络(Scene Related Anchor-Free YOLO,SRAF-YOLO)。SRAF-YOLO首先引入了一种场景增强的多尺度特征提取模块,通过将场景特征与目标特征融合,生成富含场景上下文信息的场景增强特征,并进一步利用多尺度操作提取包含场景语义的多尺度特征,有效引入场景上下文信息。在此基础上,设计了一种场景辅助无锚框检测头,利用特征图中的场景信息对目标类别预测进行约束,以提升检测精度,同时无锚框结构有效减少了锚框相关参数的计算量。在RSOD和NWPU VHR-10数据集上的实验结果表明,SRAF-YOLO通过融合场景信息和无锚框机制提升了目标检测精度,平均精度均值(mAP)分别达到94.58%和95.95%,相较于基线模型YOLOv8分别提升了1.51%和3.0%,并优于其他对比方法。在外部数据集上的验证结果进一步证实,该算法具备良好的泛化能力。
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R410),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To address this,we present SCENET-3D,a transformer-drivenmultimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline.In the first stage,scene analysis,rich geometric and texture descriptors are extracted from RGB frames,including surface-normal histograms,angles between neighboring normals,Zernike moments,directional standard deviation,and Gabor-filter responses.In the second stage,scene-object analysis,non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information.In the third stage,human-pose estimation,silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints,which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons.Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities.Experiments on UCLA(95.8%),ETRI-Activity3D(89.4%),andCAD-120(91.2%)demonstrate that combining pseudo-3D skeletonswith rich scene-object fusion significantly improves generalizable activity recognition,enabling safer elderly care,natural human–robot interaction,and robust context-aware robotic perception in real-world environments.