The interpretation of geological structures on earth observation images involves like many other domains to both visual observation as well as specialized knowledge. To help this process and make it more objective, we...The interpretation of geological structures on earth observation images involves like many other domains to both visual observation as well as specialized knowledge. To help this process and make it more objective, we propose a method to extract the components of complex shapes with a geological significance. Thus, remote sensing allows the production of digital recordings reflecting the objects’ brightness measures on the soil. These recordings are often presented as images and ready to be computer automatically processed. The numerical techniques used exploit the morphology ma- thematical transformations properties. Presentation shows the operations’ sequences with tailored properties. The example shown is a portion of an anticline fraction in which the organization shows clearly oriented entities. The results are obtained by a procedure with an interest in the geological reasoning: it is the extraction of entities involved in the observed structure and the exploration of the main direction of a set of objects striking the structure. Extraction of elementary entities is made by their physical and physiognomic characteristics recognition such as reflectance, the shadow effect, size, shape or orientation. The resulting image must then be stripped frequently of many artifacts. Another sequence has been developed to minimize the noise due to the direct identification of physical measures contained in the image. Data from different spectral bands are first filtered by an operator of grayscale morphology to remove high frequency spatial components. The image then obtained in the treatment that follows is therefore more compact and closer to the needs of the geologist. The search for significant overall direction comes from interception measures sampling a rotation from 0 to 180 degrees. The results obtained show a clear geological significance of the organization of the extracted objects.展开更多
Behavior recognition of Hu sheep contributes to their intensive and intelligent farming.Due to the generally high density of Hu sheep farming,severe occlusion occurs among different behaviors and even among sheep perf...Behavior recognition of Hu sheep contributes to their intensive and intelligent farming.Due to the generally high density of Hu sheep farming,severe occlusion occurs among different behaviors and even among sheep performing the same behavior,leading to missing and false detection issues in existing behavior recognition methods.A high-low frequency aggregated attention and negative sample comprehensive score loss and comprehensive score soft non-maximum suppression-YOLO(HLNC-YOLO)was proposed for identifying the behavior of Hu sheep,addressing the issues of missed and erroneous detections caused by occlusion between Hu sheep in intensive farming.Firstly,images of four typical behaviors-standing,lying,eating,and drinking-were collected from the sheep farm to construct the Hu sheep behavior dataset(HSBD).Next,to solve the occlusion issues,during the training phase,the C2F-HLAtt module was integrated,which combined high-low frequency aggregation attention,into the YOLO v8 Backbone to perceive occluded objects and introduce an auxiliary reversible branch to retain more effective features.Using comprehensive score regression loss(CSLoss)to reduce the scores of suboptimal boxes and enhance the comprehensive scores of occluded object boxes.Finally,the soft comprehensive score non-maximal suppression(Soft-CS-NMS)algorithm filtered prediction boxes during the inferencing.Testing on the HSBD,HLNC-YOLO achieved a mean average precision(mAP@50)of 87.8%,with a memory footprint of 17.4 MB.This represented an improvement of 7.1,2.2,4.6,and 11 percentage points over YOLO v8,YOLO v9,YOLO v10,and Faster R-CNN,respectively.Research indicated that the HLNC-YOLO accurately identified the behavior of Hu sheep in intensive farming and possessed generalization capabilities,providing technical support for smart farming.展开更多
Human Activity Recognition(HAR)is a novel area for computer vision.It has a great impact on healthcare,smart environments,and surveillance while is able to automatically detect human behavior.It plays a vital role in ...Human Activity Recognition(HAR)is a novel area for computer vision.It has a great impact on healthcare,smart environments,and surveillance while is able to automatically detect human behavior.It plays a vital role in many applications,such as smart home,healthcare,human computer interaction,sports analysis,and especially,intelligent surveillance.In this paper,we propose a robust and efficient HAR system by leveraging deep learning paradigms,including pre-trained models,CNN architectures,and their average-weighted fusion.However,due to the diversity of human actions and various environmental influences,as well as a lack of data and resources,achieving high recognition accuracy remain elusive.In this work,a weighted average ensemble technique is employed to fuse three deep learning models:EfficientNet,ResNet50,and a custom CNN.The results of this study indicate that using a weighted average ensemble strategy for developing more effective HAR models may be a promising idea for detection and classification of human activities.Experiments by using the benchmark dataset proved that the proposed weighted ensemble approach outperformed existing approaches in terms of accuracy and other key performance measures.The combined average-weighted ensemble of pre-trained and CNN models obtained an accuracy of 98%,compared to 97%,96%,and 95%for the customized CNN,EfficientNet,and ResNet50 models,respectively.展开更多
The initial noise present in the depth images obtained with RGB-D sensors is a combination of hardware limitations in addition to the environmental factors,due to the limited capabilities of sensors,which also produce...The initial noise present in the depth images obtained with RGB-D sensors is a combination of hardware limitations in addition to the environmental factors,due to the limited capabilities of sensors,which also produce poor computer vision results.The common image denoising techniques tend to remove significant image details and also remove noise,provided they are based on space and frequency filtering.The updated framework presented in this paper is a novel denoising model that makes use of Boruta-driven feature selection using a Long Short-Term Memory Autoencoder(LSTMAE).The Boruta algorithm identifies the most useful depth features that are used to maximize the spatial structure integrity and reduce redundancy.An LSTMAE is then used to process these selected features and model depth pixel sequences to generate robust,noise-resistant representations.The system uses the encoder to encode the input data into a latent space that has been compressed before it is decoded to retrieve the clean image.Experiments on a benchmark data set show that the suggested technique attains a PSNR of 45 dB and an SSIM of 0.90,which is 10 dB higher than the performance of conventional convolutional autoencoders and 15 times higher than that of the wavelet-based models.Moreover,the feature selection step will decrease the input dimensionality by 40%,resulting in a 37.5%reduction in training time and a real-time inference rate of 200 FPS.Boruta-LSTMAE framework,therefore,offers a highly efficient and scalable system for depth image denoising,with a high potential to be applied to close-range 3D systems,such as robotic manipulation and gesture-based interfaces.展开更多
Objective To develop a depression recognition model by integrating the spirit-expression diagnostic framework of traditional Chinese medicine(TCM)with machine learning algorithms.The proposed model seeks to establish ...Objective To develop a depression recognition model by integrating the spirit-expression diagnostic framework of traditional Chinese medicine(TCM)with machine learning algorithms.The proposed model seeks to establish a TCM-informed tool for early depression screening,thereby bridging traditional diagnostic principles with modern computational approaches.Methods The study included patients with depression who visited the Shanghai Pudong New Area Mental Health Center from October 1,2022 to October 1,2023,as well as students and teachers from Shanghai University of Traditional Chinese Medicine during the same period as the healthy control group.Videos of 3–10 s were captured using a Xiaomi Pad 5,and the TCM spirit and expressions were determined by TCM experts(at least 3 out of 5 experts agreed to determine the category of TCM spirit and expressions).Basic information,facial images,and interview information were collected through a portable TCM intelligent analysis and diagnosis device,and facial diagnosis features were extracted using the Open CV computer vision library technology.Statistical analysis methods such as parametric and non-parametric tests were used to analyze the baseline data,TCM spirit and expression features,and facial diagnosis feature parameters of the two groups,to compare the differences in TCM spirit and expression and facial features.Five machine learning algorithms,including extreme gradient boosting(XGBoost),decision tree(DT),Bernoulli naive Bayes(BernoulliNB),support vector machine(SVM),and k-nearest neighbor(KNN)classification,were used to construct a depression recognition model based on the fusion of TCM spirit and expression features.The performance of the model was evaluated using metrics such as accuracy,precision,and the area under the receiver operating characteristic(ROC)curve(AUC).The model results were explained using the Shapley Additive exPlanations(SHAP).Results A total of 93 depression patients and 87 healthy individuals were ultimately included in this study.There was no statistically significant difference in the baseline characteristics between the two groups(P>0.05).The differences in the characteristics of the spirit and expressions in TCM and facial features between the two groups were shown as follows.(i)Quantispirit facial analysis revealed that depression patients exhibited significantly reduced facial spirit and luminance compared with healthy controls(P<0.05),with characteristic features such as sad expressions,facial erythema,and changes in the lip color ranging from erythematous to cyanotic.(ii)Depressed patients exhibited significantly lower values in facial complexion L,lip L,and a values,and gloss index,but higher values in facial complexion a and b,lip b,low gloss index,and matte index(all P<0.05).(iii)The results of multiple models show that the XGBoost-based depression recognition model,integrating the TCM“spirit-expression”diagnostic framework,achieved an accuracy of 98.61%and significantly outperformed four benchmark algorithms—DT,BernoulliNB,SVM,and KNN(P<0.01).(iv)The SHAP visualization results show that in the recognition model constructed by the XGBoost algorithm,the complexion b value,categories of facial spirit,high gloss index,low gloss index,categories of facial expression and texture features have significant contribution to the model.Conclusion This study demonstrates that integrating TCM spirit-expression diagnostic features with machine learning enables the construction of a high-precision depression detection model,offering a novel paradigm for objective depression diagnosis.展开更多
Securing restricted zones such as airports,research facilities,and military bases requires robust and reliable access control mechanisms to prevent unauthorized entry and safeguard critical assets.Face recognition has...Securing restricted zones such as airports,research facilities,and military bases requires robust and reliable access control mechanisms to prevent unauthorized entry and safeguard critical assets.Face recognition has emerged as a key biometric approach for this purpose;however,existing systems are often sensitive to variations in illumination,occlusion,and pose,which degrade their performance in real-world conditions.To address these challenges,this paper proposes a novel hybrid face recognition method that integrates complementary feature descriptors such as Fuzzy-Gabor 2D Fisher Linear Discriminant(FG-2DFLD),Generalized 2D Linear Discriminant Analysis(G2DLDA),andModular-Local Binary Patterns(Modular-LBP)with Dempster–Shafer(DS)evidence theory for decision fusion.The proposed framework extracts global,structural,and local texture features,models them using Gaussian distributions to estimate belief factors,and fuses these belief factors through DS theory to explicitly handle uncertainty and conflict among descriptors.Experimental validation was performed on two widely used benchmark datasets,ORL and Cropped Yale B,achieving recognition rates exceeding 98%,which outperform traditional methods as well as recent deep learning-based approaches.Furthermore,the method demonstrated strong robustness under noisy conditions,maintaining accuracies above 96%with salt-and-pepper and Gaussian noise.These results highlight the effectiveness of the proposed integration strategy in enhancing accuracy,reliability,and resilience compared to single-descriptor and conventional fusion methods.Given its high performance and efficiency,the proposed method shows strong potential for deployment in real-world restricted-zone applications such as smart parking systems,secure facility access,and other high-security domains.展开更多
Deep neural networks have achieved excellent classification results on several computer vision benchmarks.This has led to the popularity of machine learning as a service,where trained algorithms are hosted on the clou...Deep neural networks have achieved excellent classification results on several computer vision benchmarks.This has led to the popularity of machine learning as a service,where trained algorithms are hosted on the cloud and inference can be obtained on real-world data.In most applications,it is important to compress the vision data due to the enormous bandwidth and memory requirements.Video codecs exploit spatial and temporal correlations to achieve high compression ratios,but they are computationally expensive.This work computes the motion fields between consecutive frames to facilitate the efficient classification of videos.However,contrary to the normal practice of reconstructing the full-resolution frames through motion compensation,this work proposes to infer the class label from the block-based computed motion fields directly.Motion fields are a richer and more complex representation of motion vectors,where each motion vector carries the magnitude and direction information.This approach has two advantages:the cost of motion compensation and video decoding is avoided,and the dimensions of the input signal are highly reduced.This results in a shallower network for classification.The neural network can be trained using motion vectors in two ways:complex representations and magnitude-direction pairs.The proposed work trains a convolutional neural network on the direction and magnitude tensors of the motion fields.Our experimental results show 20×faster convergence during training,reduced overfitting,and accelerated inference on a hand gesture recognition dataset compared to full-resolution and downsampled frames.We validate the proposed methodology on the HGds dataset,achieving a testing accuracy of 99.21%,on the HMDB51 dataset,achieving 82.54%accuracy,and on the UCF101 dataset,achieving 97.13%accuracy,outperforming state-of-the-art methods in computational efficiency.展开更多
Scene recognition is a critical component of computer vision,powering applications from autonomous vehicles to surveillance systems.However,its development is often constrained by a heavy reliance on large,expensively...Scene recognition is a critical component of computer vision,powering applications from autonomous vehicles to surveillance systems.However,its development is often constrained by a heavy reliance on large,expensively annotated datasets.This research presents a novel,efficient approach that leveragesmulti-model transfer learning from pre-trained deep neural networks—specifically DenseNet201 and Visual Geometry Group(VGG)—to overcome this limitation.Ourmethod significantly reduces dependency on vast labeled data while achieving high accuracy.Evaluated on the Aerial Image Dataset(AID)dataset,the model attained a validation accuracy of 93.6%with a loss of 0.35,demonstrating robust performance with minimal training data.These results underscore the viability of our approach for real-time,data-efficient scene recognition,offering a practical and cost-effective advancement for the field.展开更多
Discriminative region localization and efficient feature encoding are crucial for fine-grained object recognition.However,existing data augmentation methods struggle to accurately locate discriminative regions in comp...Discriminative region localization and efficient feature encoding are crucial for fine-grained object recognition.However,existing data augmentation methods struggle to accurately locate discriminative regions in complex backgrounds,small target objects,and limited training data,leading to poor recognition.Fine-grained images exhibit“small inter-class differences,”and while second-order feature encoding enhances discrimination,it often requires dual Convolutional Neural Networks(CNN),increasing training time and complexity.This study proposes a model integrating discriminative region localization and efficient second-order feature encoding.By ranking feature map channels via a fully connected layer,it selects high-importance channels to generate an enhanced map,accurately locating discriminative regions.Cropping and erasing augmentations further refine recognition.To improve efficiency,a novel second-order feature encoding module generates an attention map from the fourth convolutional group of Residual Network 50 layers(ResNet-50)and multiplies it with features from the fifth group,producing second-order features while reducing dimensionality and training time.Experiments on Caltech-University of California,San Diego Birds-200-2011(CUB-200-2011),Stanford Car,and Fine-Grained Visual Classification of Aircraft(FGVC Aircraft)datasets show state-of-the-art accuracy of 88.9%,94.7%,and 93.3%,respectively.展开更多
Gait recognition is a key biometric for long-distance identification,yet its performance is severely degraded by real-world challenges such as varying clothing,carrying conditions,and changing viewpoints.While combini...Gait recognition is a key biometric for long-distance identification,yet its performance is severely degraded by real-world challenges such as varying clothing,carrying conditions,and changing viewpoints.While combining silhouette and skeleton data is a promising direction,effectively fusing these heterogeneous modalities and adaptively weighting their contributions in response to diverse conditions remains a central problem.This paper introduces GaitMAFF,a novelMulti-modal Adaptive Feature Fusion Network,to address this challenge.Our approach first transforms discrete skeleton joints into a dense SkeletonMap representation to align with silhouettes,then employs an attention-based module to dynamically learn the fusion weights between the two modalities.These fused features are processed by a powerful spatio-temporal backbone withWeighted Global-Local Feature FusionModules(WFFM)to learn a discriminative representation.Extensive experiments on the challenging CCPG and Gait3D datasets show that GaitMAFF achieves state-of-the-art performance,with an average Rank-1 accuracy of 84.6%on CCPG and 58.7%on Gait3D.These results demonstrate that our adaptive fusion strategy effectively integrates complementary multimodal information,significantly enhancing gait recognition robustness and accuracy in complex scenes and providing a practical solution for real-world applications.展开更多
Person recognition in photo collections is a critical yet challenging task in computer vision.Previous studies have used social relationships within photo collections to address this issue.However,these methods often ...Person recognition in photo collections is a critical yet challenging task in computer vision.Previous studies have used social relationships within photo collections to address this issue.However,these methods often fail when performing single-person-in-photos recognition in photo collections,as they cannot rely on social connections for recognition.In this work,we discard social relationships and instead measure the relationships between photos to solve this problem.We designed a new model that includes a multi-parameter attention network for adaptively fusing visual features and a unified formula for measuring photo intimacy.This model effectively recognizes individuals in single photo within the collection.Due to outdated annotations and missing photos in the existing PIPA(Person in Photo Album)dataset,wemanually re-annotated it and added approximately ten thousand photos of Asian individuals to address the underrepresentation issue.Our results on the re-annotated PIPA dataset are superior to previous studies in most cases,and experiments on the supplemented dataset further demonstrate the effectiveness of our method.We have made the PIPA dataset publicly available on Zenodo,with the DOI:10.5281/zenodo.12508096(accessed on 15 October 2025).展开更多
Accurate and rapid recognition of weathering degree(WD)and groundwater condition(GC)is essential for evaluating rock mass quality and conducting stability analyses in underground engineering.Conventional WD and GC rec...Accurate and rapid recognition of weathering degree(WD)and groundwater condition(GC)is essential for evaluating rock mass quality and conducting stability analyses in underground engineering.Conventional WD and GC recognition methods often rely on subjective evaluation by field experts,supplemented by field sampling and laboratory testing.These methods are frequently complex and timeconsuming,making it challenging to meet the rapidly evolving demands of underground engineering.Therefore,this study proposes a rock non-geometric parameter classification network(RNPC-net)to rapidly achieve the recognition and mapping ofWD and GC of tunnel faces.The hybrid feature extraction module(HFEM)in RNPC-net can fully extract,fuse,and utilize multi-scale features of images,enhancing the network's classification performance.Moreover,the designed adaptive weighting auxiliary classifier(AC)helps the network learn features more efficiently.Experimental results show that RNPC-net achieved classification accuracies of 0.8756 and 0.8710 for WD and GC,respectively,representing an improvement of approximately 2%e10%compared to other methods.Both quantitative and qualitative experiments confirm the effectiveness and superiority of RNPC-net.Furthermore,for WD and GC mapping,RNPC-net outperformed other methods by achieving the highest mean intersection over union(mIOU)across most tunnel faces.The mapping results closely align with measurements provided by field experts.The application of WD and GC mapping results to the rock mass rating(RMR)system achieved a transition from conventional qualitative to quantitative evaluation.This advancement enables more accurate and reliable rock mass quality evaluations,particularly under critical conditions of RMR.展开更多
This study presents a hybrid CNN-Transformer model for real-time recognition of affective tactile biosignals.The proposed framework combines convolutional neural networks(CNNs)to extract spatial and local temporal fea...This study presents a hybrid CNN-Transformer model for real-time recognition of affective tactile biosignals.The proposed framework combines convolutional neural networks(CNNs)to extract spatial and local temporal features with the Transformer encoder that captures long-range dependencies in time-series data through multi-head attention.Model performance was evaluated on two widely used tactile biosignal datasets,HAART and CoST,which contain diverse affective touch gestures recorded from pressure sensor arrays.TheCNN-Transformer model achieved recognition rates of 93.33%on HAART and 80.89%on CoST,outperforming existing methods on both benchmarks.By incorporating temporal windowing,the model enables instantaneous prediction,improving generalization across gestures of varying duration.These results highlight the effectiveness of deep learning for tactile biosignal processing and demonstrate the potential of theCNN-Transformer approach for future applications in wearable sensors,affective computing,and biomedical monitoring.展开更多
Human activity recognition(HAR)is a method to predict human activities from sensor signals using machine learning(ML)techniques.HAR systems have several applications in various domains,including medicine,surveillance,...Human activity recognition(HAR)is a method to predict human activities from sensor signals using machine learning(ML)techniques.HAR systems have several applications in various domains,including medicine,surveillance,behavioral monitoring,and posture analysis.Extraction of suitable information from sensor data is an important part of the HAR process to recognize activities accurately.Several research studies on HAR have utilizedMel frequency cepstral coefficients(MFCCs)because of their effectiveness in capturing the periodic pattern of sensor signals.However,existing MFCC-based approaches often fail to capture sufficient temporal variability,which limits their ability to distinguish between complex or imbalanced activity classes robustly.To address this gap,this study proposes a feature fusion strategy that merges time-based and MFCC features(MFCCT)to enhance activity representation.The merged features were fed to a convolutional neural network(CNN)integrated with long shortterm memory(LSTM)—DeepConvLSTM to construct the HAR model.The MFCCT features with DeepConvLSTM achieved better performance as compared to MFCCs and time-based features on PAMAP2,UCI-HAR,and WISDM by obtaining an accuracy of 97%,98%,and 97%,respectively.In addition,DeepConvLSTM outperformed the deep learning(DL)algorithms that have recently been employed in HAR.These results confirm that the proposed hybrid features are not only practical but also generalizable,making them applicable across diverse HAR datasets for accurate activity classification.展开更多
Industrial operators need reliable communication in high-noise,safety-critical environments where speech or touch input is often impractical.Existing gesture systems either miss real-time deadlines on resourceconstrai...Industrial operators need reliable communication in high-noise,safety-critical environments where speech or touch input is often impractical.Existing gesture systems either miss real-time deadlines on resourceconstrained hardware or lose accuracy under occlusion,vibration,and lighting changes.We introduce Industrial EdgeSign,a dual-path framework that combines hardware-aware neural architecture search(NAS)with large multimodalmodel(LMM)guided semantics to deliver robust,low-latency gesture recognition on edge devices.The searched model uses a truncated ResNet50 front end,a dimensional-reduction network that preserves spatiotemporal structure for tubelet-based attention,and localized Transformer layers tuned for on-device inference.To reduce reliance on gloss annotations and mitigate domain shift,we distill semantics from factory-tuned vision-language models and pre-train with masked language modeling and video-text contrastive objectives,aligning visual features with a shared text space.OnML2HP and SHREC’17,theNAS-derived architecture attains 94.7% accuracywith 86ms inference latency and about 5.9W power on Jetson Nano.Under occlusion,lighting shifts,andmotion blur,accuracy remains above 82%.For safetycritical commands,the emergency-stop gesture achieves 72 ms 99th percentile latency with 99.7% fail-safe triggering.Ablation studies confirm the contribution of the spatiotemporal tubelet extractor and text-side pre-training,and we observe gains in translation quality(BLEU-422.33).These results show that Industrial EdgeSign provides accurate,resource-aware,and safety-aligned gesture recognition suitable for deployment in smart factory settings.展开更多
Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approa...Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms,yet the field remains fragmented due to differences in feature alignment,temporal synchronization,modality reliability,and robustness to noise or missing inputs.This survey provides a comprehensive analysis of MER research from 2021 to 2025,consolidating advances in modality-specific representation learning,cross-modal feature construction,and early,late,and hybrid fusion paradigms.We systematically review visual,acoustic,textual,and sensor-based embeddings,highlighting howpre-trained encoders,self-supervised learning,and large languagemodels have reshaped the representational foundations ofMER.We further categorize fusion strategies by interaction depth and architectural design,examining how attention mechanisms,cross-modal transformers,adaptive gating,and multimodal large language models redefine the integration of affective signals.Finally,we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability,generalization,and interpretability.This survey aims to provide a unified perspective onmultimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence.展开更多
The detection of amino acid enantiomers holds significant importance in biomedical,chemical,food,and other fields.Traditional chiral recognition methods using fluorescent probes primarily rely on fluorescence intensit...The detection of amino acid enantiomers holds significant importance in biomedical,chemical,food,and other fields.Traditional chiral recognition methods using fluorescent probes primarily rely on fluorescence intensity changes,which can compromise accuracy and repeatability.In this study,we report a novel fluorescent probe(R)-Z1 that achieves effective enantioselective recognition of chiral amino acids in water by altering emission wavelengths(>60 nm).This water-soluble probe(R)-Z1 exhibits cyan or yellow-green luminescence upon interaction with amino acid enantiomers,enabling reliable chiral detection of 14 natural amino acids.It also allows for the determination of enantiomeric excess through monitoring changes in luminescent color.Additionally,a logic operation with two inputs and three outputs was constructed based on these optical properties.Notably,amino acid enantiomers were successfully detected via dual-channel analysis at both the food and cellular levels.This study provides a new dynamic luminescence-based tool for the accurate sensing and detection of amino acid enantiomers.展开更多
Discontinuities in rock masses critically impact the stability and safety of underground engineering.Mainstream discontinuities identificationmethods,which rely on normal vector estimation and clustering algorithms,su...Discontinuities in rock masses critically impact the stability and safety of underground engineering.Mainstream discontinuities identificationmethods,which rely on normal vector estimation and clustering algorithms,suffer from accuracy degradation,omission of critical discontinuities when orientation density is unevenly distributed,and need manual intervention.To overcome these limitations,this paper introduces a novel discontinuities identificationmethod based on geometric feature analysis of rock mass.By analyzing spatial distribution variability of point cloud and integrating an adaptive region growing algorithm,the method accurately detects independent discontinuities under complex geological conditions.Given that rock mass orientations typically follow a Fisher distribution,an adaptive hierarchical clustering algorithm based on statistical analysis is employed to automatically determine the optimal number of structural sets,eliminating the need for preset clusters or thresholds inherent in traditional methods.The proposed approach effectively handles diverse rock mass shapes and sizes,leveraging both local and global geometric features to minimize noise interference.Experimental validation on three real-world rock mass models,alongside comparisons with three conventional directional clustering algorithms,demonstrates superior accuracy and robustness in identifying optimal discontinuity sets.The proposed method offers a reliable and efficienttool for discontinuities detection and grouping in underground engineering,significantlyenhancing design and construction outcomes.展开更多
Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest....Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest.However,Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length.In addition,Conformerbased architectures may not provide sufficient flexibility for modeling local dependencies at different granularities.To mitigate these limitations,this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer(RSG-Conformer)architecture.Specifically,we propose a Global-enhanced Sparse Attention(GSA)module incorporating an efficient context restoration block to recover lost contextual cues.Concurrently,a Grouped-scale Convolution(GSC)module replaces the standard Conformer convolution module,providing adaptive local modeling across varying temporal resolutions.Furthermore,we integrate a Refined Intermediate Contextual CTC(RIC-CTC)supervision strategy.This approach applies progressively increasing loss weights combined with convolution-based context aggregation,thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks.Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach,with word error rates(WERs)reduced to 1.8%and 1.5%,respectively.These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.展开更多
Accurately recognizing driver distraction is critical for preventing traffic accidents,yet current detection models face two persistent challenges.First,distractions are often fine-grained,involving subtle cues such a...Accurately recognizing driver distraction is critical for preventing traffic accidents,yet current detection models face two persistent challenges.First,distractions are often fine-grained,involving subtle cues such as brief eye closures or partial yawns,which are easily missed by conventional detectors.Second,in real-world scenarios,drivers frequently exhibit overlapping behaviors,such as simultaneously holding a cup,closing their eyes,and yawning,leading tomultiple detection boxes and degradedmodel performance.Existing approaches fail to robustly address these complexities,resulting in limited reliability in safety critical applications.To overcome these pain points,we propose YOLO-Drive,a novel framework that enhances YOLO-based driver monitoring with EfficientViM and Polarized Spectral–Spatial Attention(PSSA)modules.Efficient ViMprovides lightweight yet powerful global–local feature extraction,enabling accurate recognition of subtle driver states.PSSA further amplifies discriminative features across spatial and spectral domains,ensuring robust separation of concurrent distraction cues.By explicitly modeling fine-grained and overlapping behaviors,our approach delivers significant improvements in both precision and robustness.Extensive experiments on benchmark driver distraction datasets demonstrate that YOLO-Drive consistently out-performs stateof-the-art models,achieving higher detection accuracy while maintaining real-time efficiency.These results validate YOLO-Drive as a practical and reliable solution for advanced driver monitoring systems,addressing long-standing challenges of subtle cue recognition and multi-cue distraction detection.展开更多
文摘The interpretation of geological structures on earth observation images involves like many other domains to both visual observation as well as specialized knowledge. To help this process and make it more objective, we propose a method to extract the components of complex shapes with a geological significance. Thus, remote sensing allows the production of digital recordings reflecting the objects’ brightness measures on the soil. These recordings are often presented as images and ready to be computer automatically processed. The numerical techniques used exploit the morphology ma- thematical transformations properties. Presentation shows the operations’ sequences with tailored properties. The example shown is a portion of an anticline fraction in which the organization shows clearly oriented entities. The results are obtained by a procedure with an interest in the geological reasoning: it is the extraction of entities involved in the observed structure and the exploration of the main direction of a set of objects striking the structure. Extraction of elementary entities is made by their physical and physiognomic characteristics recognition such as reflectance, the shadow effect, size, shape or orientation. The resulting image must then be stripped frequently of many artifacts. Another sequence has been developed to minimize the noise due to the direct identification of physical measures contained in the image. Data from different spectral bands are first filtered by an operator of grayscale morphology to remove high frequency spatial components. The image then obtained in the treatment that follows is therefore more compact and closer to the needs of the geologist. The search for significant overall direction comes from interception measures sampling a rotation from 0 to 180 degrees. The results obtained show a clear geological significance of the organization of the extracted objects.
文摘Behavior recognition of Hu sheep contributes to their intensive and intelligent farming.Due to the generally high density of Hu sheep farming,severe occlusion occurs among different behaviors and even among sheep performing the same behavior,leading to missing and false detection issues in existing behavior recognition methods.A high-low frequency aggregated attention and negative sample comprehensive score loss and comprehensive score soft non-maximum suppression-YOLO(HLNC-YOLO)was proposed for identifying the behavior of Hu sheep,addressing the issues of missed and erroneous detections caused by occlusion between Hu sheep in intensive farming.Firstly,images of four typical behaviors-standing,lying,eating,and drinking-were collected from the sheep farm to construct the Hu sheep behavior dataset(HSBD).Next,to solve the occlusion issues,during the training phase,the C2F-HLAtt module was integrated,which combined high-low frequency aggregation attention,into the YOLO v8 Backbone to perceive occluded objects and introduce an auxiliary reversible branch to retain more effective features.Using comprehensive score regression loss(CSLoss)to reduce the scores of suboptimal boxes and enhance the comprehensive scores of occluded object boxes.Finally,the soft comprehensive score non-maximal suppression(Soft-CS-NMS)algorithm filtered prediction boxes during the inferencing.Testing on the HSBD,HLNC-YOLO achieved a mean average precision(mAP@50)of 87.8%,with a memory footprint of 17.4 MB.This represented an improvement of 7.1,2.2,4.6,and 11 percentage points over YOLO v8,YOLO v9,YOLO v10,and Faster R-CNN,respectively.Research indicated that the HLNC-YOLO accurately identified the behavior of Hu sheep in intensive farming and possessed generalization capabilities,providing technical support for smart farming.
基金supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2026R765),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Human Activity Recognition(HAR)is a novel area for computer vision.It has a great impact on healthcare,smart environments,and surveillance while is able to automatically detect human behavior.It plays a vital role in many applications,such as smart home,healthcare,human computer interaction,sports analysis,and especially,intelligent surveillance.In this paper,we propose a robust and efficient HAR system by leveraging deep learning paradigms,including pre-trained models,CNN architectures,and their average-weighted fusion.However,due to the diversity of human actions and various environmental influences,as well as a lack of data and resources,achieving high recognition accuracy remain elusive.In this work,a weighted average ensemble technique is employed to fuse three deep learning models:EfficientNet,ResNet50,and a custom CNN.The results of this study indicate that using a weighted average ensemble strategy for developing more effective HAR models may be a promising idea for detection and classification of human activities.Experiments by using the benchmark dataset proved that the proposed weighted ensemble approach outperformed existing approaches in terms of accuracy and other key performance measures.The combined average-weighted ensemble of pre-trained and CNN models obtained an accuracy of 98%,compared to 97%,96%,and 95%for the customized CNN,EfficientNet,and ResNet50 models,respectively.
文摘The initial noise present in the depth images obtained with RGB-D sensors is a combination of hardware limitations in addition to the environmental factors,due to the limited capabilities of sensors,which also produce poor computer vision results.The common image denoising techniques tend to remove significant image details and also remove noise,provided they are based on space and frequency filtering.The updated framework presented in this paper is a novel denoising model that makes use of Boruta-driven feature selection using a Long Short-Term Memory Autoencoder(LSTMAE).The Boruta algorithm identifies the most useful depth features that are used to maximize the spatial structure integrity and reduce redundancy.An LSTMAE is then used to process these selected features and model depth pixel sequences to generate robust,noise-resistant representations.The system uses the encoder to encode the input data into a latent space that has been compressed before it is decoded to retrieve the clean image.Experiments on a benchmark data set show that the suggested technique attains a PSNR of 45 dB and an SSIM of 0.90,which is 10 dB higher than the performance of conventional convolutional autoencoders and 15 times higher than that of the wavelet-based models.Moreover,the feature selection step will decrease the input dimensionality by 40%,resulting in a 37.5%reduction in training time and a real-time inference rate of 200 FPS.Boruta-LSTMAE framework,therefore,offers a highly efficient and scalable system for depth image denoising,with a high potential to be applied to close-range 3D systems,such as robotic manipulation and gesture-based interfaces.
基金General Program of National Natural Science Foundation of China(82474390)Construction Project of Pudong New Area Famous TCM Studios(National Pilot Zone for TCM Development,Shanghai)(PDZY-2025-0716)Shanghai Municipal Science and Technology Program Project Shanghai Key Laboratory of Health Identification and Assessment(21DZ2271000).
文摘Objective To develop a depression recognition model by integrating the spirit-expression diagnostic framework of traditional Chinese medicine(TCM)with machine learning algorithms.The proposed model seeks to establish a TCM-informed tool for early depression screening,thereby bridging traditional diagnostic principles with modern computational approaches.Methods The study included patients with depression who visited the Shanghai Pudong New Area Mental Health Center from October 1,2022 to October 1,2023,as well as students and teachers from Shanghai University of Traditional Chinese Medicine during the same period as the healthy control group.Videos of 3–10 s were captured using a Xiaomi Pad 5,and the TCM spirit and expressions were determined by TCM experts(at least 3 out of 5 experts agreed to determine the category of TCM spirit and expressions).Basic information,facial images,and interview information were collected through a portable TCM intelligent analysis and diagnosis device,and facial diagnosis features were extracted using the Open CV computer vision library technology.Statistical analysis methods such as parametric and non-parametric tests were used to analyze the baseline data,TCM spirit and expression features,and facial diagnosis feature parameters of the two groups,to compare the differences in TCM spirit and expression and facial features.Five machine learning algorithms,including extreme gradient boosting(XGBoost),decision tree(DT),Bernoulli naive Bayes(BernoulliNB),support vector machine(SVM),and k-nearest neighbor(KNN)classification,were used to construct a depression recognition model based on the fusion of TCM spirit and expression features.The performance of the model was evaluated using metrics such as accuracy,precision,and the area under the receiver operating characteristic(ROC)curve(AUC).The model results were explained using the Shapley Additive exPlanations(SHAP).Results A total of 93 depression patients and 87 healthy individuals were ultimately included in this study.There was no statistically significant difference in the baseline characteristics between the two groups(P>0.05).The differences in the characteristics of the spirit and expressions in TCM and facial features between the two groups were shown as follows.(i)Quantispirit facial analysis revealed that depression patients exhibited significantly reduced facial spirit and luminance compared with healthy controls(P<0.05),with characteristic features such as sad expressions,facial erythema,and changes in the lip color ranging from erythematous to cyanotic.(ii)Depressed patients exhibited significantly lower values in facial complexion L,lip L,and a values,and gloss index,but higher values in facial complexion a and b,lip b,low gloss index,and matte index(all P<0.05).(iii)The results of multiple models show that the XGBoost-based depression recognition model,integrating the TCM“spirit-expression”diagnostic framework,achieved an accuracy of 98.61%and significantly outperformed four benchmark algorithms—DT,BernoulliNB,SVM,and KNN(P<0.01).(iv)The SHAP visualization results show that in the recognition model constructed by the XGBoost algorithm,the complexion b value,categories of facial spirit,high gloss index,low gloss index,categories of facial expression and texture features have significant contribution to the model.Conclusion This study demonstrates that integrating TCM spirit-expression diagnostic features with machine learning enables the construction of a high-precision depression detection model,offering a novel paradigm for objective depression diagnosis.
文摘Securing restricted zones such as airports,research facilities,and military bases requires robust and reliable access control mechanisms to prevent unauthorized entry and safeguard critical assets.Face recognition has emerged as a key biometric approach for this purpose;however,existing systems are often sensitive to variations in illumination,occlusion,and pose,which degrade their performance in real-world conditions.To address these challenges,this paper proposes a novel hybrid face recognition method that integrates complementary feature descriptors such as Fuzzy-Gabor 2D Fisher Linear Discriminant(FG-2DFLD),Generalized 2D Linear Discriminant Analysis(G2DLDA),andModular-Local Binary Patterns(Modular-LBP)with Dempster–Shafer(DS)evidence theory for decision fusion.The proposed framework extracts global,structural,and local texture features,models them using Gaussian distributions to estimate belief factors,and fuses these belief factors through DS theory to explicitly handle uncertainty and conflict among descriptors.Experimental validation was performed on two widely used benchmark datasets,ORL and Cropped Yale B,achieving recognition rates exceeding 98%,which outperform traditional methods as well as recent deep learning-based approaches.Furthermore,the method demonstrated strong robustness under noisy conditions,maintaining accuracies above 96%with salt-and-pepper and Gaussian noise.These results highlight the effectiveness of the proposed integration strategy in enhancing accuracy,reliability,and resilience compared to single-descriptor and conventional fusion methods.Given its high performance and efficiency,the proposed method shows strong potential for deployment in real-world restricted-zone applications such as smart parking systems,secure facility access,and other high-security domains.
基金Supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R896).
文摘Deep neural networks have achieved excellent classification results on several computer vision benchmarks.This has led to the popularity of machine learning as a service,where trained algorithms are hosted on the cloud and inference can be obtained on real-world data.In most applications,it is important to compress the vision data due to the enormous bandwidth and memory requirements.Video codecs exploit spatial and temporal correlations to achieve high compression ratios,but they are computationally expensive.This work computes the motion fields between consecutive frames to facilitate the efficient classification of videos.However,contrary to the normal practice of reconstructing the full-resolution frames through motion compensation,this work proposes to infer the class label from the block-based computed motion fields directly.Motion fields are a richer and more complex representation of motion vectors,where each motion vector carries the magnitude and direction information.This approach has two advantages:the cost of motion compensation and video decoding is avoided,and the dimensions of the input signal are highly reduced.This results in a shallower network for classification.The neural network can be trained using motion vectors in two ways:complex representations and magnitude-direction pairs.The proposed work trains a convolutional neural network on the direction and magnitude tensors of the motion fields.Our experimental results show 20×faster convergence during training,reduced overfitting,and accelerated inference on a hand gesture recognition dataset compared to full-resolution and downsampled frames.We validate the proposed methodology on the HGds dataset,achieving a testing accuracy of 99.21%,on the HMDB51 dataset,achieving 82.54%accuracy,and on the UCF101 dataset,achieving 97.13%accuracy,outperforming state-of-the-art methods in computational efficiency.
基金funded by the Deanship of Scientific Research and Libraries,Princess Nourah bint Abdulrahman University,through the Program of Research Project Funding After Publication,grant No.(RPFAP-23-1445).
文摘Scene recognition is a critical component of computer vision,powering applications from autonomous vehicles to surveillance systems.However,its development is often constrained by a heavy reliance on large,expensively annotated datasets.This research presents a novel,efficient approach that leveragesmulti-model transfer learning from pre-trained deep neural networks—specifically DenseNet201 and Visual Geometry Group(VGG)—to overcome this limitation.Ourmethod significantly reduces dependency on vast labeled data while achieving high accuracy.Evaluated on the Aerial Image Dataset(AID)dataset,the model attained a validation accuracy of 93.6%with a loss of 0.35,demonstrating robust performance with minimal training data.These results underscore the viability of our approach for real-time,data-efficient scene recognition,offering a practical and cost-effective advancement for the field.
基金supported,in part,by the National Nature Science Foundation of China under Grant 62272236,62376128 and 62306139the Natural Science Foundation of Jiangsu Province under Grant BK20201136,BK20191401.
文摘Discriminative region localization and efficient feature encoding are crucial for fine-grained object recognition.However,existing data augmentation methods struggle to accurately locate discriminative regions in complex backgrounds,small target objects,and limited training data,leading to poor recognition.Fine-grained images exhibit“small inter-class differences,”and while second-order feature encoding enhances discrimination,it often requires dual Convolutional Neural Networks(CNN),increasing training time and complexity.This study proposes a model integrating discriminative region localization and efficient second-order feature encoding.By ranking feature map channels via a fully connected layer,it selects high-importance channels to generate an enhanced map,accurately locating discriminative regions.Cropping and erasing augmentations further refine recognition.To improve efficiency,a novel second-order feature encoding module generates an attention map from the fourth convolutional group of Residual Network 50 layers(ResNet-50)and multiplies it with features from the fifth group,producing second-order features while reducing dimensionality and training time.Experiments on Caltech-University of California,San Diego Birds-200-2011(CUB-200-2011),Stanford Car,and Fine-Grained Visual Classification of Aircraft(FGVC Aircraft)datasets show state-of-the-art accuracy of 88.9%,94.7%,and 93.3%,respectively.
基金funded by the Natural Science Foundation of Chongqing Municipality,grant number CSTB2022NSCQ-MSX0503.
文摘Gait recognition is a key biometric for long-distance identification,yet its performance is severely degraded by real-world challenges such as varying clothing,carrying conditions,and changing viewpoints.While combining silhouette and skeleton data is a promising direction,effectively fusing these heterogeneous modalities and adaptively weighting their contributions in response to diverse conditions remains a central problem.This paper introduces GaitMAFF,a novelMulti-modal Adaptive Feature Fusion Network,to address this challenge.Our approach first transforms discrete skeleton joints into a dense SkeletonMap representation to align with silhouettes,then employs an attention-based module to dynamically learn the fusion weights between the two modalities.These fused features are processed by a powerful spatio-temporal backbone withWeighted Global-Local Feature FusionModules(WFFM)to learn a discriminative representation.Extensive experiments on the challenging CCPG and Gait3D datasets show that GaitMAFF achieves state-of-the-art performance,with an average Rank-1 accuracy of 84.6%on CCPG and 58.7%on Gait3D.These results demonstrate that our adaptive fusion strategy effectively integrates complementary multimodal information,significantly enhancing gait recognition robustness and accuracy in complex scenes and providing a practical solution for real-world applications.
基金supported by“the Fundamental Research Funds for the Central Universities”(GrantNos.:3282025045,3282024008)“Science and Technology Project of the State ArchivesAdministration ofChina”(Grant No.:2025-Z-009).
文摘Person recognition in photo collections is a critical yet challenging task in computer vision.Previous studies have used social relationships within photo collections to address this issue.However,these methods often fail when performing single-person-in-photos recognition in photo collections,as they cannot rely on social connections for recognition.In this work,we discard social relationships and instead measure the relationships between photos to solve this problem.We designed a new model that includes a multi-parameter attention network for adaptively fusing visual features and a unified formula for measuring photo intimacy.This model effectively recognizes individuals in single photo within the collection.Due to outdated annotations and missing photos in the existing PIPA(Person in Photo Album)dataset,wemanually re-annotated it and added approximately ten thousand photos of Asian individuals to address the underrepresentation issue.Our results on the re-annotated PIPA dataset are superior to previous studies in most cases,and experiments on the supplemented dataset further demonstrate the effectiveness of our method.We have made the PIPA dataset publicly available on Zenodo,with the DOI:10.5281/zenodo.12508096(accessed on 15 October 2025).
基金supported by the National Natural Science Foundation of China(Grant Nos.42077242 and 42171407)the Graduate Innovation Fund of Jilin University.
文摘Accurate and rapid recognition of weathering degree(WD)and groundwater condition(GC)is essential for evaluating rock mass quality and conducting stability analyses in underground engineering.Conventional WD and GC recognition methods often rely on subjective evaluation by field experts,supplemented by field sampling and laboratory testing.These methods are frequently complex and timeconsuming,making it challenging to meet the rapidly evolving demands of underground engineering.Therefore,this study proposes a rock non-geometric parameter classification network(RNPC-net)to rapidly achieve the recognition and mapping ofWD and GC of tunnel faces.The hybrid feature extraction module(HFEM)in RNPC-net can fully extract,fuse,and utilize multi-scale features of images,enhancing the network's classification performance.Moreover,the designed adaptive weighting auxiliary classifier(AC)helps the network learn features more efficiently.Experimental results show that RNPC-net achieved classification accuracies of 0.8756 and 0.8710 for WD and GC,respectively,representing an improvement of approximately 2%e10%compared to other methods.Both quantitative and qualitative experiments confirm the effectiveness and superiority of RNPC-net.Furthermore,for WD and GC mapping,RNPC-net outperformed other methods by achieving the highest mean intersection over union(mIOU)across most tunnel faces.The mapping results closely align with measurements provided by field experts.The application of WD and GC mapping results to the rock mass rating(RMR)system achieved a transition from conventional qualitative to quantitative evaluation.This advancement enables more accurate and reliable rock mass quality evaluations,particularly under critical conditions of RMR.
文摘This study presents a hybrid CNN-Transformer model for real-time recognition of affective tactile biosignals.The proposed framework combines convolutional neural networks(CNNs)to extract spatial and local temporal features with the Transformer encoder that captures long-range dependencies in time-series data through multi-head attention.Model performance was evaluated on two widely used tactile biosignal datasets,HAART and CoST,which contain diverse affective touch gestures recorded from pressure sensor arrays.TheCNN-Transformer model achieved recognition rates of 93.33%on HAART and 80.89%on CoST,outperforming existing methods on both benchmarks.By incorporating temporal windowing,the model enables instantaneous prediction,improving generalization across gestures of varying duration.These results highlight the effectiveness of deep learning for tactile biosignal processing and demonstrate the potential of theCNN-Transformer approach for future applications in wearable sensors,affective computing,and biomedical monitoring.
基金supported by Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia through the Researchers Supporting Project PNURSP2025R333.
文摘Human activity recognition(HAR)is a method to predict human activities from sensor signals using machine learning(ML)techniques.HAR systems have several applications in various domains,including medicine,surveillance,behavioral monitoring,and posture analysis.Extraction of suitable information from sensor data is an important part of the HAR process to recognize activities accurately.Several research studies on HAR have utilizedMel frequency cepstral coefficients(MFCCs)because of their effectiveness in capturing the periodic pattern of sensor signals.However,existing MFCC-based approaches often fail to capture sufficient temporal variability,which limits their ability to distinguish between complex or imbalanced activity classes robustly.To address this gap,this study proposes a feature fusion strategy that merges time-based and MFCC features(MFCCT)to enhance activity representation.The merged features were fed to a convolutional neural network(CNN)integrated with long shortterm memory(LSTM)—DeepConvLSTM to construct the HAR model.The MFCCT features with DeepConvLSTM achieved better performance as compared to MFCCs and time-based features on PAMAP2,UCI-HAR,and WISDM by obtaining an accuracy of 97%,98%,and 97%,respectively.In addition,DeepConvLSTM outperformed the deep learning(DL)algorithms that have recently been employed in HAR.These results confirm that the proposed hybrid features are not only practical but also generalizable,making them applicable across diverse HAR datasets for accurate activity classification.
文摘Industrial operators need reliable communication in high-noise,safety-critical environments where speech or touch input is often impractical.Existing gesture systems either miss real-time deadlines on resourceconstrained hardware or lose accuracy under occlusion,vibration,and lighting changes.We introduce Industrial EdgeSign,a dual-path framework that combines hardware-aware neural architecture search(NAS)with large multimodalmodel(LMM)guided semantics to deliver robust,low-latency gesture recognition on edge devices.The searched model uses a truncated ResNet50 front end,a dimensional-reduction network that preserves spatiotemporal structure for tubelet-based attention,and localized Transformer layers tuned for on-device inference.To reduce reliance on gloss annotations and mitigate domain shift,we distill semantics from factory-tuned vision-language models and pre-train with masked language modeling and video-text contrastive objectives,aligning visual features with a shared text space.OnML2HP and SHREC’17,theNAS-derived architecture attains 94.7% accuracywith 86ms inference latency and about 5.9W power on Jetson Nano.Under occlusion,lighting shifts,andmotion blur,accuracy remains above 82%.For safetycritical commands,the emergency-stop gesture achieves 72 ms 99th percentile latency with 99.7% fail-safe triggering.Ablation studies confirm the contribution of the spatiotemporal tubelet extractor and text-side pre-training,and we observe gains in translation quality(BLEU-422.33).These results show that Industrial EdgeSign provides accurate,resource-aware,and safety-aligned gesture recognition suitable for deployment in smart factory settings.
基金supported by the Institute of Information&Communications Technology Planning&Evaluation grant funded by the Korea government(MSIT)(No.RS-2021-II211341,AI Graduate School Support Program,Chung-Ang University)in part by the Institute of Information and Communications Technology Planning and Evaluation grant funded by the Korea government(MSIT)(Development of Integrated Development Framework that Supports Automatic Neural Network Generation and Deployment Optimized for Runtime Environment,Grant No.2021-0-00766).
文摘Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms,yet the field remains fragmented due to differences in feature alignment,temporal synchronization,modality reliability,and robustness to noise or missing inputs.This survey provides a comprehensive analysis of MER research from 2021 to 2025,consolidating advances in modality-specific representation learning,cross-modal feature construction,and early,late,and hybrid fusion paradigms.We systematically review visual,acoustic,textual,and sensor-based embeddings,highlighting howpre-trained encoders,self-supervised learning,and large languagemodels have reshaped the representational foundations ofMER.We further categorize fusion strategies by interaction depth and architectural design,examining how attention mechanisms,cross-modal transformers,adaptive gating,and multimodal large language models redefine the integration of affective signals.Finally,we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability,generalization,and interpretability.This survey aims to provide a unified perspective onmultimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence.
基金the financial support from the National Natural Science Foundation of China(Nos.22377097,22307036,22074114)Natural Science Foundation of Hubei Province of China(Nos.2020CFB623,2021CFB556)Engineering Research Center of Phosphorus Resources Development and Utilization of Ministry of Education(No.LCX202305)。
文摘The detection of amino acid enantiomers holds significant importance in biomedical,chemical,food,and other fields.Traditional chiral recognition methods using fluorescent probes primarily rely on fluorescence intensity changes,which can compromise accuracy and repeatability.In this study,we report a novel fluorescent probe(R)-Z1 that achieves effective enantioselective recognition of chiral amino acids in water by altering emission wavelengths(>60 nm).This water-soluble probe(R)-Z1 exhibits cyan or yellow-green luminescence upon interaction with amino acid enantiomers,enabling reliable chiral detection of 14 natural amino acids.It also allows for the determination of enantiomeric excess through monitoring changes in luminescent color.Additionally,a logic operation with two inputs and three outputs was constructed based on these optical properties.Notably,amino acid enantiomers were successfully detected via dual-channel analysis at both the food and cellular levels.This study provides a new dynamic luminescence-based tool for the accurate sensing and detection of amino acid enantiomers.
基金the National Key Research and Development Program of China(Grant No.2023YFC3009400).
文摘Discontinuities in rock masses critically impact the stability and safety of underground engineering.Mainstream discontinuities identificationmethods,which rely on normal vector estimation and clustering algorithms,suffer from accuracy degradation,omission of critical discontinuities when orientation density is unevenly distributed,and need manual intervention.To overcome these limitations,this paper introduces a novel discontinuities identificationmethod based on geometric feature analysis of rock mass.By analyzing spatial distribution variability of point cloud and integrating an adaptive region growing algorithm,the method accurately detects independent discontinuities under complex geological conditions.Given that rock mass orientations typically follow a Fisher distribution,an adaptive hierarchical clustering algorithm based on statistical analysis is employed to automatically determine the optimal number of structural sets,eliminating the need for preset clusters or thresholds inherent in traditional methods.The proposed approach effectively handles diverse rock mass shapes and sizes,leveraging both local and global geometric features to minimize noise interference.Experimental validation on three real-world rock mass models,alongside comparisons with three conventional directional clustering algorithms,demonstrates superior accuracy and robustness in identifying optimal discontinuity sets.The proposed method offers a reliable and efficienttool for discontinuities detection and grouping in underground engineering,significantlyenhancing design and construction outcomes.
基金supported in part by the National Natural Science Foundation of China:61773330.
文摘Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest.However,Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length.In addition,Conformerbased architectures may not provide sufficient flexibility for modeling local dependencies at different granularities.To mitigate these limitations,this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer(RSG-Conformer)architecture.Specifically,we propose a Global-enhanced Sparse Attention(GSA)module incorporating an efficient context restoration block to recover lost contextual cues.Concurrently,a Grouped-scale Convolution(GSC)module replaces the standard Conformer convolution module,providing adaptive local modeling across varying temporal resolutions.Furthermore,we integrate a Refined Intermediate Contextual CTC(RIC-CTC)supervision strategy.This approach applies progressively increasing loss weights combined with convolution-based context aggregation,thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks.Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach,with word error rates(WERs)reduced to 1.8%and 1.5%,respectively.These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.
基金funded by the Guangzhou Development Zone Science and Technology Project(2023GH02)the University of Macao(MYRG2022-00271-FST)research grants by the Science and Technology Development Fund of Macao(0032/2022/A)and(0019/2025/RIB1).
文摘Accurately recognizing driver distraction is critical for preventing traffic accidents,yet current detection models face two persistent challenges.First,distractions are often fine-grained,involving subtle cues such as brief eye closures or partial yawns,which are easily missed by conventional detectors.Second,in real-world scenarios,drivers frequently exhibit overlapping behaviors,such as simultaneously holding a cup,closing their eyes,and yawning,leading tomultiple detection boxes and degradedmodel performance.Existing approaches fail to robustly address these complexities,resulting in limited reliability in safety critical applications.To overcome these pain points,we propose YOLO-Drive,a novel framework that enhances YOLO-based driver monitoring with EfficientViM and Polarized Spectral–Spatial Attention(PSSA)modules.Efficient ViMprovides lightweight yet powerful global–local feature extraction,enabling accurate recognition of subtle driver states.PSSA further amplifies discriminative features across spatial and spectral domains,ensuring robust separation of concurrent distraction cues.By explicitly modeling fine-grained and overlapping behaviors,our approach delivers significant improvements in both precision and robustness.Extensive experiments on benchmark driver distraction datasets demonstrate that YOLO-Drive consistently out-performs stateof-the-art models,achieving higher detection accuracy while maintaining real-time efficiency.These results validate YOLO-Drive as a practical and reliable solution for advanced driver monitoring systems,addressing long-standing challenges of subtle cue recognition and multi-cue distraction detection.