Deep learning has made significant progress in the field of oriented object detection for remote sensing images.However,existing methods still face challenges when dealing with difficult tasks such as multi-scale targ...Deep learning has made significant progress in the field of oriented object detection for remote sensing images.However,existing methods still face challenges when dealing with difficult tasks such as multi-scale targets,complex backgrounds,and small objects in remote sensing.Maintaining model lightweight to address resource constraints in remote sensing scenarios while improving task completion for remote sensing tasks remains a research hotspot.Therefore,we propose an enhanced multi-scale feature extraction lightweight network EM-YOLO based on the YOLOv8s architecture,specifically optimized for the characteristics of large target scale variations,diverse orientations,and numerous small objects in remote sensing images.Our innovations lie in two main aspects:First,a dynamic snake convolution(DSC)is introduced into the backbone network to enhance the model’s feature extraction capability for oriented targets.Second,an innovative focusing-diffusion module is designed in the feature fusion neck to effectively integrate multi-scale feature information.Finally,we introduce Layer-Adaptive Sparsity for magnitude-based Pruning(LASP)method to perform lightweight network pruning to better complete tasks in resource-constrained scenarios.Experimental results on the lightweight platform Orin demonstrate that the proposed method significantly outperforms the original YOLOv8s model in oriented remote sensing object detection tasks,and achieves comparable or superior performance to state-of-the-art methods on three authoritative remote sensing datasets(DOTA v1.0,DOTA v1.5,and HRSC2016).展开更多
The ubiquity of mobile devices has driven advancements in mobile object detection.However,challenges in multi-scale object detection in open,complex environments persist due to limited computational resources.Traditio...The ubiquity of mobile devices has driven advancements in mobile object detection.However,challenges in multi-scale object detection in open,complex environments persist due to limited computational resources.Traditional approaches like network compression,quantization,and lightweight design often sacrifice accuracy or feature representation robustness.This article introduces the Fast Multi-scale Channel Shuffling Network(FMCSNet),a novel lightweight detection model optimized for mobile devices.FMCSNet integrates a fully convolutional Multilayer Perceptron(MLP)module,offering global perception without significantly increasing parameters,effectively bridging the gap between CNNs and Vision Transformers.FMCSNet achieves a delicate balance between computation and accuracy mainly by two key modules:the ShiftMLP module,including a shift operation and an MLP module,and a Partial group Convolutional(PGConv)module,reducing computation while enhancing information exchange between channels.With a computational complexity of 1.4G FLOPs and 1.3M parameters,FMCSNet outperforms CNN-based and DWConv-based ShuffleNetv2 by 1%and 4.5%mAP on the Pascal VOC 2007 dataset,respectively.Additionally,FMCSNet achieves a mAP of 30.0(0.5:0.95 IoU threshold)with only 2.5G FLOPs and 2.0M parameters.It achieves 32 FPS on low-performance i5-series CPUs,meeting real-time detection requirements.The versatility of the PGConv module’s adaptability across scenarios further highlights FMCSNet as a promising solution for real-time mobile object detection.展开更多
In recent years,with the rapid advancement of artificial intelligence,object detection algorithms have made significant strides in accuracy and computational efficiency.Notably,research and applications of Anchor-Free...In recent years,with the rapid advancement of artificial intelligence,object detection algorithms have made significant strides in accuracy and computational efficiency.Notably,research and applications of Anchor-Free models have opened new avenues for real-time target detection in optical remote sensing images(ORSIs).However,in the realmof adversarial attacks,developing adversarial techniques tailored to Anchor-Freemodels remains challenging.Adversarial examples generated based on Anchor-Based models often exhibit poor transferability to these new model architectures.Furthermore,the growing diversity of Anchor-Free models poses additional hurdles to achieving robust transferability of adversarial attacks.This study presents an improved cross-conv-block feature fusion You Only Look Once(YOLO)architecture,meticulously engineered to facilitate the extraction ofmore comprehensive semantic features during the backpropagation process.To address the asymmetry between densely distributed objects in ORSIs and the corresponding detector outputs,a novel dense bounding box attack strategy is proposed.This approach leverages dense target bounding boxes loss in the calculation of adversarial loss functions.Furthermore,by integrating translation-invariant(TI)and momentum-iteration(MI)adversarial methodologies,the proposed framework significantly improves the transferability of adversarial attacks.Experimental results demonstrate that our method achieves superior adversarial attack performance,with adversarial transferability rates(ATR)of 67.53%on the NWPU VHR-10 dataset and 90.71%on the HRSC2016 dataset.Compared to ensemble adversarial attack and cascaded adversarial attack approaches,our method generates adversarial examples in an average of 0.64 s,representing an approximately 14.5%improvement in efficiency under equivalent conditions.展开更多
The goal of the present work is to demonstrate the potential of Artificial Neural Network(ANN)-driven Genetic Algorithm(GA)methods for energy efficiency and economic performance optimization of energy efficiency measu...The goal of the present work is to demonstrate the potential of Artificial Neural Network(ANN)-driven Genetic Algorithm(GA)methods for energy efficiency and economic performance optimization of energy efficiency measures in a multi-family house building in Greece.The energy efficiency measures include different heating/cooling systems(such as low-temperature and high-temperature heat pumps,natural gas boilers,split units),building envelope components for floor,walls,roof and windows of variable heat transfer coefficients,the installation of solar thermal collectors and PVs.The calculations of the building loads and investment and operating and maintenance costs of the measures are based on the methodology defined in Directive 2010/31/EU,while economic assumptions are based on EN 15459-1 standard.Typically,multi-objective optimization of energy efficiency measures often requires the simulation of very large numbers of cases involving numerous possible combinations,resulting in intense computational load.The results of the study indicate that ANN-driven GA methods can be used as an alternative,valuable tool for reliably predicting the optimal measures which minimize primary energy consumption and life cycle cost of the building with greatly reduced computational requirements.Through GA methods,the computational time needed for obtaining the optimal solutions is reduced by 96.4%-96.8%.展开更多
棉花生长过程中受到害虫严重危害,因此精准的害虫检测已成为智慧农业体系中的关键环节。其中大量棉田害虫属于小目标,特征提取困难,而且害虫个体之间存在显著的尺寸差异,这限制了现有目标检测算法的性能。提出了一种结合离散小波变换与T...棉花生长过程中受到害虫严重危害,因此精准的害虫检测已成为智慧农业体系中的关键环节。其中大量棉田害虫属于小目标,特征提取困难,而且害虫个体之间存在显著的尺寸差异,这限制了现有目标检测算法的性能。提出了一种结合离散小波变换与Transformer的YOLO11目标检测算法——WTNet-YOLO(wavelet and Transformer network-YOLO)。融合部分卷积与多尺度深度卷积构建C3K2-MKPF模块,增强对多尺寸目标的特征提取能力。在颈部结合小波域融合模块(wavelet domain fusion module,WDFM)和跨阶段部分局部和全局模块(cross stage partial local and global block,CSP-LGB),提升各尺寸害虫的频域信息表达与全局信息定位。引入多尺度自适应空间注意门(multi-scale adaptive spatial attention gate,MASAG),动态融合主干与颈部的跨层特征,强化空间与语义信息表达。为验证相关方法,构建了一个棉田害虫数据集YST-PestCotton(yellow sticky trap pest dataset in cotton),涵盖多个尺寸范围的害虫,具有显著的尺度多样性,害虫像素面积最大可相差1200多倍。实验表明,在YST-PestCotton上mAP50提升了3.1个百分点,同时将害虫按目标框面积划分为0~256、256~512、512~1024和大于1024四个子集,mAP50分别提升2.4、1.3、1.5、3个百分点。在公开数据集Yellow sticky traps上mAP50达到了最高的95.3%。综合来看,WTNet-YOLO能够有效应对小目标内部的尺寸差异,同时兼顾不同尺寸害虫的检测需求。展开更多
Drone-based small object detection is of great significance in practical applications such as military actions, disaster rescue, transportation, etc. However, the severe scale differences in objects captured by drones...Drone-based small object detection is of great significance in practical applications such as military actions, disaster rescue, transportation, etc. However, the severe scale differences in objects captured by drones and lack of detail information for small-scale objects make drone-based small object detection a formidable challenge. To address these issues, we first develop a mathematical model to explore how changing receptive fields impacts the polynomial fitting results. Subsequently, based on the obtained conclusions, we propose a simple but effective Hybrid Receptive Field Network (HRFNet), whose modules include Hybrid Feature Augmentation (HFA), Hybrid Feature Pyramid (HFP) and Dual Scale Head (DSH). Specifically, HFA employs parallel dilated convolution kernels of different sizes to extend shallow features with different receptive fields, committed to improving the multi-scale adaptability of the network;HFP enhances the perception of small objects by capturing contextual information across layers, while DSH reconstructs the original prediction head utilizing a set of high-resolution features and ultrahigh-resolution features. In addition, in order to train HRFNet, the corresponding dual-scale loss function is designed. Finally, comprehensive evaluation results on public benchmarks such as VisDrone-DET and TinyPerson demonstrate the robustness of the proposed method. Most impressively, the proposed HRFNet achieves a mAP of 51.0 on VisDrone-DET with 29.3 M parameters, which outperforms the extant state-of-the-art detectors. HRFNet also performs excellently in complex scenarios captured by drones, achieving the best performance on the CS-Drone dataset we built.展开更多
Visible and infrared(RGB-IR)fusion object detection plays an important role in security,disaster relief,etc.In recent years,deep-learning-based RGB-IR fusion detection methods have been developing rapidly,but still st...Visible and infrared(RGB-IR)fusion object detection plays an important role in security,disaster relief,etc.In recent years,deep-learning-based RGB-IR fusion detection methods have been developing rapidly,but still struggle to deal with the complex and changing scenarios captured by drones,mainly due to two reasons:(A)RGB-IR fusion detectors are susceptible to inferior inputs that degrade performance and stability.(B)RGB-IR fusion detectors are susceptible to redundant features that reduce accuracy and efficiency.In this paper,an innovative RGB-IR fusion detection framework based on global-local feature optimization,named GLFDet,is proposed to improve the detection performance and efficiency of drone-captured objects.The key components of GLFDet include a Global Feature Optimization(GFO)module,a Local Feature Optimization(LFO)module and a Channel Separation Fusion(CSF)module.Specifically,GFO calculates the information content of the input image from the frequency domain and optimizes the features holistically.Then,LFO dynamically selects high-value features and filters out low-value features before fusion,which significantly improves the efficiency of fusion.Finally,CSF fuses the RGB and IR features across the corresponding channels,which avoids the rearrangement of the channel relationships and enhances the model stability.Extensive experimental results show that the proposed method achieves the best performance on three popular RGB-IR datasets Drone Vehicle,VEDAI,and LLVIP.In addition,GLFDet is more lightweight than other comparable models,making it more appealing to edge devices such as drones.The code is available at https://github.com/lao chen330/GLFDet.展开更多
Aiming at the problems of low detection accuracy and large model size of existing object detection algorithms applied to complex road scenes,an improved you only look once version 8(YOLOv8)object detection algorithm f...Aiming at the problems of low detection accuracy and large model size of existing object detection algorithms applied to complex road scenes,an improved you only look once version 8(YOLOv8)object detection algorithm for infrared images,F-YOLOv8,is proposed.First,a spatial-to-depth network replaces the traditional backbone network's strided convolution or pooling layer.At the same time,it combines with the channel attention mechanism so that the neural network focuses on the channels with large weight values to better extract low-resolution image feature information;then an improved feature pyramid network of lightweight bidirectional feature pyramid network(L-BiFPN)is proposed,which can efficiently fuse features of different scales.In addition,a loss function of insertion of union based on the minimum point distance(MPDIoU)is introduced for bounding box regression,which obtains faster convergence speed and more accurate regression results.Experimental results on the FLIR dataset show that the improved algorithm can accurately detect infrared road targets in real time with 3%and 2.2%enhancement in mean average precision at 50%IoU(mAP50)and mean average precision at 50%—95%IoU(mAP50-95),respectively,and 38.1%,37.3%and 16.9%reduction in the number of model parameters,the model weight,and floating-point operations per second(FLOPs),respectively.To further demonstrate the detection capability of the improved algorithm,it is tested on the public dataset PASCAL VOC,and the results show that F-YOLO has excellent generalized detection performance.展开更多
Object detection plays a critical role in drone imagery analysis,especially in remote sensing applications where accurate and efficient detection of small objects is essential.Despite significant advancements in drone...Object detection plays a critical role in drone imagery analysis,especially in remote sensing applications where accurate and efficient detection of small objects is essential.Despite significant advancements in drone imagery detection,most models still struggle with small object detection due to challenges such as object size,complex backgrounds.To address these issues,we propose a robust detection model based on You Only Look Once(YOLO)that balances accuracy and efficiency.The model mainly contains several major innovation:feature selection pyramid network,Inner-Shape Intersection over Union(ISIoU)loss function and small object detection head.To overcome the limitations of traditional fusion methods in handling multi-level features,we introduce a Feature Selection Pyramid Network integrated into the Neck component,which preserves shallow feature details critical for detecting small objects.Additionally,recognizing that deep network structures often neglect or degrade small object features,we design a specialized small object detection head in the shallow layers to enhance detection accuracy for these challenging targets.To effectively model both local and global dependencies,we introduce a Conv-Former module that simulates Transformer mechanisms using a convolutional structure,thereby improving feature enhancement.Furthermore,we employ ISIoU to address object imbalance and scale variation This approach accelerates model conver-gence and improves regression accuracy.Experimental results show that,compared to the baseline model,the proposed method significantly improves small object detection performance on the VisDrone2019 dataset,with mAP@50 increasing by 4.9%and mAP@50-95 rising by 6.7%.This model also outperforms other state-of-the-art algorithms,demonstrating its reliability and effectiveness in both small object detection and remote sensing image fusion tasks.展开更多
Visible-infrared object detection leverages the day-night stable object perception capability of infrared images to enhance detection robustness in low-light environments by fusing the complementary information of vis...Visible-infrared object detection leverages the day-night stable object perception capability of infrared images to enhance detection robustness in low-light environments by fusing the complementary information of visible and infrared images.However,the inherent differences in the imaging mechanisms of visible and infrared modalities make effective cross-modal fusion challenging.Furthermore,constrained by the physical characteristics of sensors and thermal diffusion effects,infrared images generally suffer from blurred object contours and missing details,making it difficult to extract object features effectively.To address these issues,we propose an infrared-visible image fusion network that realizesmultimodal information fusion of infrared and visible images through a carefully designedmultiscale fusion strategy.First,we design an adaptive gray-radiance enhancement(AGRE)module to strengthen the detail representation in infrared images,improving their usability in complex lighting scenarios.Next,we introduce a channelspatial feature interaction(CSFI)module,which achieves efficient complementarity between the RGB and infrared(IR)modalities via dynamic channel switching and a spatial attention mechanism.Finally,we propose a multi-scale enhanced cross-attention fusion(MSECA)module,which optimizes the fusion ofmulti-level features through dynamic convolution and gating mechanisms and captures long-range complementary relationships of cross-modal features on a global scale,thereby enhancing the expressiveness of the fused features.Experiments on the KAIST,M3FD,and FLIR datasets demonstrate that our method delivers outstanding performance in daytime and nighttime scenarios.On the KAIST dataset,the miss rate drops to 5.99%,and further to 4.26% in night scenes.On the FLIR and M3FD datasets,it achieves AP50 scores of 79.4% and 88.9%,respectively.展开更多
Accurate detection of small objects is critically important in high-stakes applications such as military reconnaissance and emergency rescue.However,low resolution,occlusion,and background interference make small obje...Accurate detection of small objects is critically important in high-stakes applications such as military reconnaissance and emergency rescue.However,low resolution,occlusion,and background interference make small object detection a complex and demanding task.One effective approach to overcome these issues is the integration of multimodal image data to enhance detection capabilities.This paper proposes a novel small object detection method that utilizes three types of multimodal image combinations,such as Hyperspectral-Multispectral(HSMS),Hyperspectral-Synthetic Aperture Radar(HS-SAR),and HS-SAR-Digital Surface Model(HS-SAR-DSM).The detection process is done by the proposed Jaccard Deep Q-Net(JDQN),which integrates the Jaccard similarity measure with a Deep Q-Network(DQN)using regression modeling.To produce the final output,a Deep Maxout Network(DMN)is employed to fuse the detection results obtained from each modality.The effectiveness of the proposed JDQN is validated using performance metrics,such as accuracy,Mean Squared Error(MSE),precision,and Root Mean Squared Error(RMSE).Experimental results demonstrate that the proposed JDQN method outperforms existing approaches,achieving the highest accuracy of 0.907,a precision of 0.904,the lowest normalized MSE of 0.279,and a normalized RMSE of 0.528.展开更多
Currently,challenges such as small object size and occlusion lead to a lack of accuracy and robustness in small object detection.Since small objects occupy only a few pixels in an image,the extracted features are limi...Currently,challenges such as small object size and occlusion lead to a lack of accuracy and robustness in small object detection.Since small objects occupy only a few pixels in an image,the extracted features are limited,and mainstream downsampling convolution operations further exacerbate feature loss.Additionally,due to the occlusionprone nature of small objects and their higher sensitivity to localization deviations,conventional Intersection over Union(IoU)loss functions struggle to achieve stable convergence.To address these limitations,LR-Net is proposed for small object detection.Specifically,the proposed Lossless Feature Fusion(LFF)method transfers spatial features into the channel domain while leveraging a hybrid attentionmechanism to focus on critical features,mitigating feature loss caused by downsampling.Furthermore,RSIoU is proposed to enhance the convergence performance of IoU-based losses for small objects.RSIoU corrects the inherent convergence direction issues in SIoU and proposes a penalty term as a Dynamic Focusing Mechanism parameter,enabling it to dynamically emphasize the loss contribution of small object samples.Ultimately,RSIoU significantly improves the convergence performance of the loss function for small objects,particularly under occlusion scenarios.Experiments demonstrate that LR-Net achieves significant improvements across variousmetrics onmultiple datasets compared with YOLOv8n,achieving a 3.7% increase in mean Average Precision(AP)on the VisDrone2019 dataset,along with improvements of 3.3% on the AI-TOD dataset and 1.2% on the COCO dataset.展开更多
Real-time detection of surface defects on cables is crucial for ensuring the safe operation of power systems.However,existing methods struggle with small target sizes,complex backgrounds,low-quality image acquisition,...Real-time detection of surface defects on cables is crucial for ensuring the safe operation of power systems.However,existing methods struggle with small target sizes,complex backgrounds,low-quality image acquisition,and interference from contamination.To address these challenges,this paper proposes the Real-time Cable Defect Detection Network(RC2DNet),which achieves an optimal balance between detection accuracy and computational efficiency.Unlike conventional approaches,RC2DNet introduces a small object feature extraction module that enhances the semantic representation of small targets through feature pyramids,multi-level feature fusion,and an adaptive weighting mechanism.Additionally,a boundary feature enhancement module is designed,incorporating boundary-aware convolution,a novel boundary attention mechanism,and an improved loss function to significantly enhance boundary localization accuracy.Experimental results demonstrate that RC2DNet outperforms state-of-the-art methods in precision,recall,F1-score,mean Intersection over Union(mIoU),and frame rate,enabling real-time and highly accurate cable defect detection in complex backgrounds.展开更多
When detecting objects in Unmanned Aerial Vehicle(UAV)taken images,large number of objects and high proportion of small objects bring huge challenges for detection algorithms based on the You Only Look Once(YOLO)frame...When detecting objects in Unmanned Aerial Vehicle(UAV)taken images,large number of objects and high proportion of small objects bring huge challenges for detection algorithms based on the You Only Look Once(YOLO)framework,rendering them challenging to deal with tasks that demand high precision.To address these problems,this paper proposes a high-precision object detection algorithm based on YOLOv10s.Firstly,a Multi-branch Enhancement Coordinate Attention(MECA)module is proposed to enhance feature extraction capability.Secondly,a Multilayer Feature Reconstruction(MFR)mechanism is designed to fully exploit multilayer features,which can enrich object information as well as remove redundant information.Finally,an MFR Path Aggregation Network(MFR-Neck)is constructed,which integrates multi-scale features to improve the network's ability to perceive objects of var-ying sizes.The experimental results demonstrate that the proposed algorithm increases the average detection accuracy by 14.15%on the Vis Drone dataset compared to YOLOv10s,effectively enhancing object detection precision in UAV-taken images.展开更多
To overcome the obstacles of poor feature extraction and little prior information on the appearance of infrared dim small targets,we propose a multi-domain attention-guided pyramid network(MAGPNet).Specifically,we des...To overcome the obstacles of poor feature extraction and little prior information on the appearance of infrared dim small targets,we propose a multi-domain attention-guided pyramid network(MAGPNet).Specifically,we design three modules to ensure that salient features of small targets can be acquired and retained in the multi-scale feature maps.To improve the adaptability of the network for targets of different sizes,we design a kernel aggregation attention block with a receptive field attention branch and weight the feature maps under different perceptual fields with attention mechanism.Based on the research on human vision system,we further propose an adaptive local contrast measure module to enhance the local features of infrared small targets.With this parameterized component,we can implement the information aggregation of multi-scale contrast saliency maps.Finally,to fully utilize the information within spatial and channel domains in feature maps of different scales,we propose the mixed spatial-channel attention-guided fusion module to achieve high-quality fusion effects while ensuring that the small target features can be preserved at deep layers.Experiments on public datasets demonstrate that our MAGPNet can achieve a better performance over other state-of-the-art methods in terms of the intersection of union,Precision,Recall,and F-measure.In addition,we conduct detailed ablation studies to verify the effectiveness of each component in our network.展开更多
A novel dual-branch decoding fusion convolutional neural network model(DDFNet)specifically designed for real-time salient object detection(SOD)on steel surfaces is proposed.DDFNet is based on a standard encoder–decod...A novel dual-branch decoding fusion convolutional neural network model(DDFNet)specifically designed for real-time salient object detection(SOD)on steel surfaces is proposed.DDFNet is based on a standard encoder–decoder architecture.DDFNet integrates three key innovations:first,we introduce a novel,lightweight multi-scale progressive aggregation residual network that effectively suppresses background interference and refines defect details,enabling efficient salient feature extraction.Then,we propose an innovative dual-branch decoding fusion structure,comprising the refined defect representation branch and the enhanced defect representation branch,which enhance accuracy in defect region identification and feature representation.Additionally,to further improve the detection of small and complex defects,we incorporate a multi-scale attention fusion module.Experimental results on the public ESDIs-SOD dataset show that DDFNet,with only 3.69 million parameters,achieves detection performance comparable to current state-of-the-art models,demonstrating its potential for real-time industrial applications.Furthermore,our DDFNet-L variant consistently outperforms leading methods in detection performance.The code is available at https://github.com/13140W/DDFNet.展开更多
The increasing prevalence of violent incidents in public spaces has created an urgent need for intelligent surveillance systems capable of detecting dangerous objects in real time.While traditional video surveillance ...The increasing prevalence of violent incidents in public spaces has created an urgent need for intelligent surveillance systems capable of detecting dangerous objects in real time.While traditional video surveillance relies on human monitoring,this approach suffers from limitations such as fatigue and delayed response times.This study addresses these challenges by developing an automated detection system using advanced deep learning techniques to enhance public safety.Our approach leverages state-of-the-art convolutional neural networks(CNNs),specifically You Only Look Once version 4(YOLOv4)and EfficientDet,for real-time object detection.The system was trained on a comprehensive dataset of over 50,000 images,enhanced through data augmentation techniques to improve robustness across varying lighting conditions and viewing angles.Cloud-based deployment on Amazon Web Services(AWS)ensured scalability and efficient processing.Experimental evaluations demonstrated high performance,with YOLOv4 achieving 92%accuracy and processing images in 0.45 s,while EfficientDet reached 93%accuracy with a slightly longer processing time of 0.55 s per image.Field tests in high-traffic environments such as train stations and shopping malls confirmed the system’s reliability,with a false alarm rate of only 4.5%.The integration of automatic alerts enabled rapid security responses to potential threats.The proposed CNN-based system provides an effective solution for real-time detection of dangerous objects in video surveillance,significantly improving response times and public safety.While YOLOv4 proved more suitable for speed-critical applications,EfficientDet offered marginally better accuracy.Future work will focus on optimizing the system for low-light conditions and further reducing false positives.This research contributes to the advancement of AI-driven surveillance technologies,offering a scalable framework adaptable to various security scenarios.展开更多
Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To addre...Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To address this,we present SCENET-3D,a transformer-drivenmultimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline.In the first stage,scene analysis,rich geometric and texture descriptors are extracted from RGB frames,including surface-normal histograms,angles between neighboring normals,Zernike moments,directional standard deviation,and Gabor-filter responses.In the second stage,scene-object analysis,non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information.In the third stage,human-pose estimation,silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints,which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons.Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities.Experiments on UCLA(95.8%),ETRI-Activity3D(89.4%),andCAD-120(91.2%)demonstrate that combining pseudo-3D skeletonswith rich scene-object fusion significantly improves generalizable activity recognition,enabling safer elderly care,natural human–robot interaction,and robust context-aware robotic perception in real-world environments.展开更多
基金funded by the Hainan Province Science and Technology Special Fund under Grant ZDYF2024GXJS292.
文摘Deep learning has made significant progress in the field of oriented object detection for remote sensing images.However,existing methods still face challenges when dealing with difficult tasks such as multi-scale targets,complex backgrounds,and small objects in remote sensing.Maintaining model lightweight to address resource constraints in remote sensing scenarios while improving task completion for remote sensing tasks remains a research hotspot.Therefore,we propose an enhanced multi-scale feature extraction lightweight network EM-YOLO based on the YOLOv8s architecture,specifically optimized for the characteristics of large target scale variations,diverse orientations,and numerous small objects in remote sensing images.Our innovations lie in two main aspects:First,a dynamic snake convolution(DSC)is introduced into the backbone network to enhance the model’s feature extraction capability for oriented targets.Second,an innovative focusing-diffusion module is designed in the feature fusion neck to effectively integrate multi-scale feature information.Finally,we introduce Layer-Adaptive Sparsity for magnitude-based Pruning(LASP)method to perform lightweight network pruning to better complete tasks in resource-constrained scenarios.Experimental results on the lightweight platform Orin demonstrate that the proposed method significantly outperforms the original YOLOv8s model in oriented remote sensing object detection tasks,and achieves comparable or superior performance to state-of-the-art methods on three authoritative remote sensing datasets(DOTA v1.0,DOTA v1.5,and HRSC2016).
基金funded by the National Natural Science Foundation of China under Grant No.62371187the Open Program of Hunan Intelligent Rehabilitation Robot and Auxiliary Equipment Engineering Technology Research Center under Grant No.2024JS101.
文摘The ubiquity of mobile devices has driven advancements in mobile object detection.However,challenges in multi-scale object detection in open,complex environments persist due to limited computational resources.Traditional approaches like network compression,quantization,and lightweight design often sacrifice accuracy or feature representation robustness.This article introduces the Fast Multi-scale Channel Shuffling Network(FMCSNet),a novel lightweight detection model optimized for mobile devices.FMCSNet integrates a fully convolutional Multilayer Perceptron(MLP)module,offering global perception without significantly increasing parameters,effectively bridging the gap between CNNs and Vision Transformers.FMCSNet achieves a delicate balance between computation and accuracy mainly by two key modules:the ShiftMLP module,including a shift operation and an MLP module,and a Partial group Convolutional(PGConv)module,reducing computation while enhancing information exchange between channels.With a computational complexity of 1.4G FLOPs and 1.3M parameters,FMCSNet outperforms CNN-based and DWConv-based ShuffleNetv2 by 1%and 4.5%mAP on the Pascal VOC 2007 dataset,respectively.Additionally,FMCSNet achieves a mAP of 30.0(0.5:0.95 IoU threshold)with only 2.5G FLOPs and 2.0M parameters.It achieves 32 FPS on low-performance i5-series CPUs,meeting real-time detection requirements.The versatility of the PGConv module’s adaptability across scenarios further highlights FMCSNet as a promising solution for real-time mobile object detection.
文摘In recent years,with the rapid advancement of artificial intelligence,object detection algorithms have made significant strides in accuracy and computational efficiency.Notably,research and applications of Anchor-Free models have opened new avenues for real-time target detection in optical remote sensing images(ORSIs).However,in the realmof adversarial attacks,developing adversarial techniques tailored to Anchor-Freemodels remains challenging.Adversarial examples generated based on Anchor-Based models often exhibit poor transferability to these new model architectures.Furthermore,the growing diversity of Anchor-Free models poses additional hurdles to achieving robust transferability of adversarial attacks.This study presents an improved cross-conv-block feature fusion You Only Look Once(YOLO)architecture,meticulously engineered to facilitate the extraction ofmore comprehensive semantic features during the backpropagation process.To address the asymmetry between densely distributed objects in ORSIs and the corresponding detector outputs,a novel dense bounding box attack strategy is proposed.This approach leverages dense target bounding boxes loss in the calculation of adversarial loss functions.Furthermore,by integrating translation-invariant(TI)and momentum-iteration(MI)adversarial methodologies,the proposed framework significantly improves the transferability of adversarial attacks.Experimental results demonstrate that our method achieves superior adversarial attack performance,with adversarial transferability rates(ATR)of 67.53%on the NWPU VHR-10 dataset and 90.71%on the HRSC2016 dataset.Compared to ensemble adversarial attack and cascaded adversarial attack approaches,our method generates adversarial examples in an average of 0.64 s,representing an approximately 14.5%improvement in efficiency under equivalent conditions.
文摘The goal of the present work is to demonstrate the potential of Artificial Neural Network(ANN)-driven Genetic Algorithm(GA)methods for energy efficiency and economic performance optimization of energy efficiency measures in a multi-family house building in Greece.The energy efficiency measures include different heating/cooling systems(such as low-temperature and high-temperature heat pumps,natural gas boilers,split units),building envelope components for floor,walls,roof and windows of variable heat transfer coefficients,the installation of solar thermal collectors and PVs.The calculations of the building loads and investment and operating and maintenance costs of the measures are based on the methodology defined in Directive 2010/31/EU,while economic assumptions are based on EN 15459-1 standard.Typically,multi-objective optimization of energy efficiency measures often requires the simulation of very large numbers of cases involving numerous possible combinations,resulting in intense computational load.The results of the study indicate that ANN-driven GA methods can be used as an alternative,valuable tool for reliably predicting the optimal measures which minimize primary energy consumption and life cycle cost of the building with greatly reduced computational requirements.Through GA methods,the computational time needed for obtaining the optimal solutions is reduced by 96.4%-96.8%.
文摘棉花生长过程中受到害虫严重危害,因此精准的害虫检测已成为智慧农业体系中的关键环节。其中大量棉田害虫属于小目标,特征提取困难,而且害虫个体之间存在显著的尺寸差异,这限制了现有目标检测算法的性能。提出了一种结合离散小波变换与Transformer的YOLO11目标检测算法——WTNet-YOLO(wavelet and Transformer network-YOLO)。融合部分卷积与多尺度深度卷积构建C3K2-MKPF模块,增强对多尺寸目标的特征提取能力。在颈部结合小波域融合模块(wavelet domain fusion module,WDFM)和跨阶段部分局部和全局模块(cross stage partial local and global block,CSP-LGB),提升各尺寸害虫的频域信息表达与全局信息定位。引入多尺度自适应空间注意门(multi-scale adaptive spatial attention gate,MASAG),动态融合主干与颈部的跨层特征,强化空间与语义信息表达。为验证相关方法,构建了一个棉田害虫数据集YST-PestCotton(yellow sticky trap pest dataset in cotton),涵盖多个尺寸范围的害虫,具有显著的尺度多样性,害虫像素面积最大可相差1200多倍。实验表明,在YST-PestCotton上mAP50提升了3.1个百分点,同时将害虫按目标框面积划分为0~256、256~512、512~1024和大于1024四个子集,mAP50分别提升2.4、1.3、1.5、3个百分点。在公开数据集Yellow sticky traps上mAP50达到了最高的95.3%。综合来看,WTNet-YOLO能够有效应对小目标内部的尺寸差异,同时兼顾不同尺寸害虫的检测需求。
基金supported by the National Natural Science Foundation of China(Nos.62276204 and 62203343)the Fundamental Research Funds for the Central Universities(No.YJSJ24011)+1 种基金the Natural Science Basic Research Program of Shanxi,China(Nos.2022JM-340 and 2023-JC-QN-0710)the China Postdoctoral Science Foundation(Nos.2020T130494 and 2018M633470).
文摘Drone-based small object detection is of great significance in practical applications such as military actions, disaster rescue, transportation, etc. However, the severe scale differences in objects captured by drones and lack of detail information for small-scale objects make drone-based small object detection a formidable challenge. To address these issues, we first develop a mathematical model to explore how changing receptive fields impacts the polynomial fitting results. Subsequently, based on the obtained conclusions, we propose a simple but effective Hybrid Receptive Field Network (HRFNet), whose modules include Hybrid Feature Augmentation (HFA), Hybrid Feature Pyramid (HFP) and Dual Scale Head (DSH). Specifically, HFA employs parallel dilated convolution kernels of different sizes to extend shallow features with different receptive fields, committed to improving the multi-scale adaptability of the network;HFP enhances the perception of small objects by capturing contextual information across layers, while DSH reconstructs the original prediction head utilizing a set of high-resolution features and ultrahigh-resolution features. In addition, in order to train HRFNet, the corresponding dual-scale loss function is designed. Finally, comprehensive evaluation results on public benchmarks such as VisDrone-DET and TinyPerson demonstrate the robustness of the proposed method. Most impressively, the proposed HRFNet achieves a mAP of 51.0 on VisDrone-DET with 29.3 M parameters, which outperforms the extant state-of-the-art detectors. HRFNet also performs excellently in complex scenarios captured by drones, achieving the best performance on the CS-Drone dataset we built.
基金supported by the National Natural Science Foundation of China(No.62276204)the Fundamental Research Funds for the Central Universities,China(No.YJSJ24011)+1 种基金the Natural Science Basic Research Program of Shaanxi,China(Nos.2022JM-340 and 2023-JC-QN-0710)the China Postdoctoral Science Foundation(Nos.2020T130494 and 2018M633470)。
文摘Visible and infrared(RGB-IR)fusion object detection plays an important role in security,disaster relief,etc.In recent years,deep-learning-based RGB-IR fusion detection methods have been developing rapidly,but still struggle to deal with the complex and changing scenarios captured by drones,mainly due to two reasons:(A)RGB-IR fusion detectors are susceptible to inferior inputs that degrade performance and stability.(B)RGB-IR fusion detectors are susceptible to redundant features that reduce accuracy and efficiency.In this paper,an innovative RGB-IR fusion detection framework based on global-local feature optimization,named GLFDet,is proposed to improve the detection performance and efficiency of drone-captured objects.The key components of GLFDet include a Global Feature Optimization(GFO)module,a Local Feature Optimization(LFO)module and a Channel Separation Fusion(CSF)module.Specifically,GFO calculates the information content of the input image from the frequency domain and optimizes the features holistically.Then,LFO dynamically selects high-value features and filters out low-value features before fusion,which significantly improves the efficiency of fusion.Finally,CSF fuses the RGB and IR features across the corresponding channels,which avoids the rearrangement of the channel relationships and enhances the model stability.Extensive experimental results show that the proposed method achieves the best performance on three popular RGB-IR datasets Drone Vehicle,VEDAI,and LLVIP.In addition,GLFDet is more lightweight than other comparable models,making it more appealing to edge devices such as drones.The code is available at https://github.com/lao chen330/GLFDet.
基金supported by the National Natural Science Foundation of China(No.62103298)。
文摘Aiming at the problems of low detection accuracy and large model size of existing object detection algorithms applied to complex road scenes,an improved you only look once version 8(YOLOv8)object detection algorithm for infrared images,F-YOLOv8,is proposed.First,a spatial-to-depth network replaces the traditional backbone network's strided convolution or pooling layer.At the same time,it combines with the channel attention mechanism so that the neural network focuses on the channels with large weight values to better extract low-resolution image feature information;then an improved feature pyramid network of lightweight bidirectional feature pyramid network(L-BiFPN)is proposed,which can efficiently fuse features of different scales.In addition,a loss function of insertion of union based on the minimum point distance(MPDIoU)is introduced for bounding box regression,which obtains faster convergence speed and more accurate regression results.Experimental results on the FLIR dataset show that the improved algorithm can accurately detect infrared road targets in real time with 3%and 2.2%enhancement in mean average precision at 50%IoU(mAP50)and mean average precision at 50%—95%IoU(mAP50-95),respectively,and 38.1%,37.3%and 16.9%reduction in the number of model parameters,the model weight,and floating-point operations per second(FLOPs),respectively.To further demonstrate the detection capability of the improved algorithm,it is tested on the public dataset PASCAL VOC,and the results show that F-YOLO has excellent generalized detection performance.
文摘Object detection plays a critical role in drone imagery analysis,especially in remote sensing applications where accurate and efficient detection of small objects is essential.Despite significant advancements in drone imagery detection,most models still struggle with small object detection due to challenges such as object size,complex backgrounds.To address these issues,we propose a robust detection model based on You Only Look Once(YOLO)that balances accuracy and efficiency.The model mainly contains several major innovation:feature selection pyramid network,Inner-Shape Intersection over Union(ISIoU)loss function and small object detection head.To overcome the limitations of traditional fusion methods in handling multi-level features,we introduce a Feature Selection Pyramid Network integrated into the Neck component,which preserves shallow feature details critical for detecting small objects.Additionally,recognizing that deep network structures often neglect or degrade small object features,we design a specialized small object detection head in the shallow layers to enhance detection accuracy for these challenging targets.To effectively model both local and global dependencies,we introduce a Conv-Former module that simulates Transformer mechanisms using a convolutional structure,thereby improving feature enhancement.Furthermore,we employ ISIoU to address object imbalance and scale variation This approach accelerates model conver-gence and improves regression accuracy.Experimental results show that,compared to the baseline model,the proposed method significantly improves small object detection performance on the VisDrone2019 dataset,with mAP@50 increasing by 4.9%and mAP@50-95 rising by 6.7%.This model also outperforms other state-of-the-art algorithms,demonstrating its reliability and effectiveness in both small object detection and remote sensing image fusion tasks.
基金supported by the National Natural Science Foundation of China(Grant No.62302086)the Natural Science Foundation of Liaoning Province(Grant No.2023-MSBA-070)the Fundamental Research Funds for the Central Universities(Grant No.N2317005).
文摘Visible-infrared object detection leverages the day-night stable object perception capability of infrared images to enhance detection robustness in low-light environments by fusing the complementary information of visible and infrared images.However,the inherent differences in the imaging mechanisms of visible and infrared modalities make effective cross-modal fusion challenging.Furthermore,constrained by the physical characteristics of sensors and thermal diffusion effects,infrared images generally suffer from blurred object contours and missing details,making it difficult to extract object features effectively.To address these issues,we propose an infrared-visible image fusion network that realizesmultimodal information fusion of infrared and visible images through a carefully designedmultiscale fusion strategy.First,we design an adaptive gray-radiance enhancement(AGRE)module to strengthen the detail representation in infrared images,improving their usability in complex lighting scenarios.Next,we introduce a channelspatial feature interaction(CSFI)module,which achieves efficient complementarity between the RGB and infrared(IR)modalities via dynamic channel switching and a spatial attention mechanism.Finally,we propose a multi-scale enhanced cross-attention fusion(MSECA)module,which optimizes the fusion ofmulti-level features through dynamic convolution and gating mechanisms and captures long-range complementary relationships of cross-modal features on a global scale,thereby enhancing the expressiveness of the fused features.Experiments on the KAIST,M3FD,and FLIR datasets demonstrate that our method delivers outstanding performance in daytime and nighttime scenarios.On the KAIST dataset,the miss rate drops to 5.99%,and further to 4.26% in night scenes.On the FLIR and M3FD datasets,it achieves AP50 scores of 79.4% and 88.9%,respectively.
文摘Accurate detection of small objects is critically important in high-stakes applications such as military reconnaissance and emergency rescue.However,low resolution,occlusion,and background interference make small object detection a complex and demanding task.One effective approach to overcome these issues is the integration of multimodal image data to enhance detection capabilities.This paper proposes a novel small object detection method that utilizes three types of multimodal image combinations,such as Hyperspectral-Multispectral(HSMS),Hyperspectral-Synthetic Aperture Radar(HS-SAR),and HS-SAR-Digital Surface Model(HS-SAR-DSM).The detection process is done by the proposed Jaccard Deep Q-Net(JDQN),which integrates the Jaccard similarity measure with a Deep Q-Network(DQN)using regression modeling.To produce the final output,a Deep Maxout Network(DMN)is employed to fuse the detection results obtained from each modality.The effectiveness of the proposed JDQN is validated using performance metrics,such as accuracy,Mean Squared Error(MSE),precision,and Root Mean Squared Error(RMSE).Experimental results demonstrate that the proposed JDQN method outperforms existing approaches,achieving the highest accuracy of 0.907,a precision of 0.904,the lowest normalized MSE of 0.279,and a normalized RMSE of 0.528.
基金supported by Chongqing Municipal Commission of Housing and Urban-Rural Development(Grant No.CKZ2024-87)China Chongqing Municipal Science and Technology Bureau(Grant No.2024TIAD-CYKJCXX0121).
文摘Currently,challenges such as small object size and occlusion lead to a lack of accuracy and robustness in small object detection.Since small objects occupy only a few pixels in an image,the extracted features are limited,and mainstream downsampling convolution operations further exacerbate feature loss.Additionally,due to the occlusionprone nature of small objects and their higher sensitivity to localization deviations,conventional Intersection over Union(IoU)loss functions struggle to achieve stable convergence.To address these limitations,LR-Net is proposed for small object detection.Specifically,the proposed Lossless Feature Fusion(LFF)method transfers spatial features into the channel domain while leveraging a hybrid attentionmechanism to focus on critical features,mitigating feature loss caused by downsampling.Furthermore,RSIoU is proposed to enhance the convergence performance of IoU-based losses for small objects.RSIoU corrects the inherent convergence direction issues in SIoU and proposes a penalty term as a Dynamic Focusing Mechanism parameter,enabling it to dynamically emphasize the loss contribution of small object samples.Ultimately,RSIoU significantly improves the convergence performance of the loss function for small objects,particularly under occlusion scenarios.Experiments demonstrate that LR-Net achieves significant improvements across variousmetrics onmultiple datasets compared with YOLOv8n,achieving a 3.7% increase in mean Average Precision(AP)on the VisDrone2019 dataset,along with improvements of 3.3% on the AI-TOD dataset and 1.2% on the COCO dataset.
基金supported by the National Natural Science Foundation of China under Grant 62306128the Basic Science Research Project of Jiangsu Provincial Department of Education under Grant 23KJD520003the Leading Innovation Project of Changzhou Science and Technology Bureau under Grant CQ20230072.
文摘Real-time detection of surface defects on cables is crucial for ensuring the safe operation of power systems.However,existing methods struggle with small target sizes,complex backgrounds,low-quality image acquisition,and interference from contamination.To address these challenges,this paper proposes the Real-time Cable Defect Detection Network(RC2DNet),which achieves an optimal balance between detection accuracy and computational efficiency.Unlike conventional approaches,RC2DNet introduces a small object feature extraction module that enhances the semantic representation of small targets through feature pyramids,multi-level feature fusion,and an adaptive weighting mechanism.Additionally,a boundary feature enhancement module is designed,incorporating boundary-aware convolution,a novel boundary attention mechanism,and an improved loss function to significantly enhance boundary localization accuracy.Experimental results demonstrate that RC2DNet outperforms state-of-the-art methods in precision,recall,F1-score,mean Intersection over Union(mIoU),and frame rate,enabling real-time and highly accurate cable defect detection in complex backgrounds.
基金co-supported by the National Natural Science Foundation of China(No.62103190)the Natural Science Foundation of Jiangsu Province,China(No.BK20230923)。
文摘When detecting objects in Unmanned Aerial Vehicle(UAV)taken images,large number of objects and high proportion of small objects bring huge challenges for detection algorithms based on the You Only Look Once(YOLO)framework,rendering them challenging to deal with tasks that demand high precision.To address these problems,this paper proposes a high-precision object detection algorithm based on YOLOv10s.Firstly,a Multi-branch Enhancement Coordinate Attention(MECA)module is proposed to enhance feature extraction capability.Secondly,a Multilayer Feature Reconstruction(MFR)mechanism is designed to fully exploit multilayer features,which can enrich object information as well as remove redundant information.Finally,an MFR Path Aggregation Network(MFR-Neck)is constructed,which integrates multi-scale features to improve the network's ability to perceive objects of var-ying sizes.The experimental results demonstrate that the proposed algorithm increases the average detection accuracy by 14.15%on the Vis Drone dataset compared to YOLOv10s,effectively enhancing object detection precision in UAV-taken images.
基金the Industry-University-Research Cooperation Fund Project of the Eighth Research Institute of China Aerospace Science and Technology Corporation(No.USCAST2021-5)。
文摘To overcome the obstacles of poor feature extraction and little prior information on the appearance of infrared dim small targets,we propose a multi-domain attention-guided pyramid network(MAGPNet).Specifically,we design three modules to ensure that salient features of small targets can be acquired and retained in the multi-scale feature maps.To improve the adaptability of the network for targets of different sizes,we design a kernel aggregation attention block with a receptive field attention branch and weight the feature maps under different perceptual fields with attention mechanism.Based on the research on human vision system,we further propose an adaptive local contrast measure module to enhance the local features of infrared small targets.With this parameterized component,we can implement the information aggregation of multi-scale contrast saliency maps.Finally,to fully utilize the information within spatial and channel domains in feature maps of different scales,we propose the mixed spatial-channel attention-guided fusion module to achieve high-quality fusion effects while ensuring that the small target features can be preserved at deep layers.Experiments on public datasets demonstrate that our MAGPNet can achieve a better performance over other state-of-the-art methods in terms of the intersection of union,Precision,Recall,and F-measure.In addition,we conduct detailed ablation studies to verify the effectiveness of each component in our network.
基金supported in part by the National Key R&D Program of China(Grant No.2023YFB3307604)the Shanxi Province Basic Research Program Youth Science Research Project(Grant Nos.202303021212054 and 202303021212046)+3 种基金the Key Projects Supported by Hebei Natural Science Foundation(Grant No.E2024203125)the National Science Foundation of China(Grant No.52105391)the Hebei Provincial Science and Technology Major Project(Grant No.23280101Z)the National Key Laboratory of Metal Forming Technology and Heavy Equipment Open Fund(Grant No.S2308100.W17).
文摘A novel dual-branch decoding fusion convolutional neural network model(DDFNet)specifically designed for real-time salient object detection(SOD)on steel surfaces is proposed.DDFNet is based on a standard encoder–decoder architecture.DDFNet integrates three key innovations:first,we introduce a novel,lightweight multi-scale progressive aggregation residual network that effectively suppresses background interference and refines defect details,enabling efficient salient feature extraction.Then,we propose an innovative dual-branch decoding fusion structure,comprising the refined defect representation branch and the enhanced defect representation branch,which enhance accuracy in defect region identification and feature representation.Additionally,to further improve the detection of small and complex defects,we incorporate a multi-scale attention fusion module.Experimental results on the public ESDIs-SOD dataset show that DDFNet,with only 3.69 million parameters,achieves detection performance comparable to current state-of-the-art models,demonstrating its potential for real-time industrial applications.Furthermore,our DDFNet-L variant consistently outperforms leading methods in detection performance.The code is available at https://github.com/13140W/DDFNet.
文摘The increasing prevalence of violent incidents in public spaces has created an urgent need for intelligent surveillance systems capable of detecting dangerous objects in real time.While traditional video surveillance relies on human monitoring,this approach suffers from limitations such as fatigue and delayed response times.This study addresses these challenges by developing an automated detection system using advanced deep learning techniques to enhance public safety.Our approach leverages state-of-the-art convolutional neural networks(CNNs),specifically You Only Look Once version 4(YOLOv4)and EfficientDet,for real-time object detection.The system was trained on a comprehensive dataset of over 50,000 images,enhanced through data augmentation techniques to improve robustness across varying lighting conditions and viewing angles.Cloud-based deployment on Amazon Web Services(AWS)ensured scalability and efficient processing.Experimental evaluations demonstrated high performance,with YOLOv4 achieving 92%accuracy and processing images in 0.45 s,while EfficientDet reached 93%accuracy with a slightly longer processing time of 0.55 s per image.Field tests in high-traffic environments such as train stations and shopping malls confirmed the system’s reliability,with a false alarm rate of only 4.5%.The integration of automatic alerts enabled rapid security responses to potential threats.The proposed CNN-based system provides an effective solution for real-time detection of dangerous objects in video surveillance,significantly improving response times and public safety.While YOLOv4 proved more suitable for speed-critical applications,EfficientDet offered marginally better accuracy.Future work will focus on optimizing the system for low-light conditions and further reducing false positives.This research contributes to the advancement of AI-driven surveillance technologies,offering a scalable framework adaptable to various security scenarios.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R410),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Human object detection and recognition is essential for elderly monitoring and assisted living however,models relying solely on pose or scene context often struggle in cluttered or visually ambiguous settings.To address this,we present SCENET-3D,a transformer-drivenmultimodal framework that unifies human-centric skeleton features with scene-object semantics for intelligent robotic vision through a three-stage pipeline.In the first stage,scene analysis,rich geometric and texture descriptors are extracted from RGB frames,including surface-normal histograms,angles between neighboring normals,Zernike moments,directional standard deviation,and Gabor-filter responses.In the second stage,scene-object analysis,non-human objects are segmented and represented using local feature descriptors and complementary surface-normal information.In the third stage,human-pose estimation,silhouettes are processed through an enhanced MoveNet to obtain 2D anatomical keypoints,which are fused with depth information and converted into RGB-based point clouds to construct pseudo-3D skeletons.Features from all three stages are fused and fed in a transformer encoder with multi-head attention to resolve visually similar activities.Experiments on UCLA(95.8%),ETRI-Activity3D(89.4%),andCAD-120(91.2%)demonstrate that combining pseudo-3D skeletonswith rich scene-object fusion significantly improves generalizable activity recognition,enabling safer elderly care,natural human–robot interaction,and robust context-aware robotic perception in real-world environments.