Discriminative region localization and efficient feature encoding are crucial for fine-grained object recognition.However,existing data augmentation methods struggle to accurately locate discriminative regions in comp...Discriminative region localization and efficient feature encoding are crucial for fine-grained object recognition.However,existing data augmentation methods struggle to accurately locate discriminative regions in complex backgrounds,small target objects,and limited training data,leading to poor recognition.Fine-grained images exhibit“small inter-class differences,”and while second-order feature encoding enhances discrimination,it often requires dual Convolutional Neural Networks(CNN),increasing training time and complexity.This study proposes a model integrating discriminative region localization and efficient second-order feature encoding.By ranking feature map channels via a fully connected layer,it selects high-importance channels to generate an enhanced map,accurately locating discriminative regions.Cropping and erasing augmentations further refine recognition.To improve efficiency,a novel second-order feature encoding module generates an attention map from the fourth convolutional group of Residual Network 50 layers(ResNet-50)and multiplies it with features from the fifth group,producing second-order features while reducing dimensionality and training time.Experiments on Caltech-University of California,San Diego Birds-200-2011(CUB-200-2011),Stanford Car,and Fine-Grained Visual Classification of Aircraft(FGVC Aircraft)datasets show state-of-the-art accuracy of 88.9%,94.7%,and 93.3%,respectively.展开更多
With the emergence and development of social networks,people can stay in touch with friends,family,and colleagues more quickly and conveniently,regardless of their location.This ubiquitous digital internet environment...With the emergence and development of social networks,people can stay in touch with friends,family,and colleagues more quickly and conveniently,regardless of their location.This ubiquitous digital internet environment has also led to large-scale disclosure of personal privacy.Due to the complexity and subtlety of sensitive information,traditional sensitive information identification technologies cannot thoroughly address the characteristics of each piece of data,thus weakening the deep connections between text and images.In this context,this paper adopts the CLIP model as a modality discriminator.By using comparative learning between sensitive image descriptions and images,the similarity between the images and the sensitive descriptions is obtained to determine whether the images contain sensitive information.This provides the basis for identifying sensitive information using different modalities.Specifically,if the original data does not contain sensitive information,only single-modality text-sensitive information identification is performed;if the original data contains sensitive information,multimodality sensitive information identification is conducted.This approach allows for differentiated processing of each piece of data,thereby achieving more accurate sensitive information identification.The aforementioned modality discriminator can address the limitations of existing sensitive information identification technologies,making the identification of sensitive information from the original data more appropriate and precise.展开更多
Accurately recognizing driver distraction is critical for preventing traffic accidents,yet current detection models face two persistent challenges.First,distractions are often fine-grained,involving subtle cues such a...Accurately recognizing driver distraction is critical for preventing traffic accidents,yet current detection models face two persistent challenges.First,distractions are often fine-grained,involving subtle cues such as brief eye closures or partial yawns,which are easily missed by conventional detectors.Second,in real-world scenarios,drivers frequently exhibit overlapping behaviors,such as simultaneously holding a cup,closing their eyes,and yawning,leading tomultiple detection boxes and degradedmodel performance.Existing approaches fail to robustly address these complexities,resulting in limited reliability in safety critical applications.To overcome these pain points,we propose YOLO-Drive,a novel framework that enhances YOLO-based driver monitoring with EfficientViM and Polarized Spectral–Spatial Attention(PSSA)modules.Efficient ViMprovides lightweight yet powerful global–local feature extraction,enabling accurate recognition of subtle driver states.PSSA further amplifies discriminative features across spatial and spectral domains,ensuring robust separation of concurrent distraction cues.By explicitly modeling fine-grained and overlapping behaviors,our approach delivers significant improvements in both precision and robustness.Extensive experiments on benchmark driver distraction datasets demonstrate that YOLO-Drive consistently out-performs stateof-the-art models,achieving higher detection accuracy while maintaining real-time efficiency.These results validate YOLO-Drive as a practical and reliable solution for advanced driver monitoring systems,addressing long-standing challenges of subtle cue recognition and multi-cue distraction detection.展开更多
Fine-grained Image Recognition(FGIR)task is dedicated to distinguishing similar sub-categories that belong to the same super-category,such as bird species and car types.In order to highlight visual differences,existin...Fine-grained Image Recognition(FGIR)task is dedicated to distinguishing similar sub-categories that belong to the same super-category,such as bird species and car types.In order to highlight visual differences,existing FGIR works often follow two steps:discriminative sub-region localization and local feature representation.However,these works pay less attention on global context information.They neglect a fact that the subtle visual difference in challenging scenarios can be highlighted through exploiting the spatial relationship among different subregions from a global view point.Therefore,in this paper,we consider both global and local information for FGIR,and propose a collaborative teacher-student strategy to reinforce and unity the two types of information.Our framework is implemented mainly by convolutional neural network,referred to Teacher-Student Based Attention Convolutional Neural Network(T-S-ACNN).For fine-grained local information,we choose the classic Multi-Attention Network(MA-Net)as our baseline,and propose a type of boundary constraint to further reduce background noises in the local attention maps.In this way,the discriminative sub-regions tend to appear in the area occupied by fine-grained objects,leading to more accurate sub-region localization.For fine-grained global information,we design a graph convolution based Global Attention Network(GA-Net),which can combine extracted local attention maps from MA-Net with non-local techniques to explore spatial relationship among subregions.At last,we develop a collaborative teacher-student strategy to adaptively determine the attended roles and optimization modes,so as to enhance the cooperative reinforcement of MA-Net and GA-Net.Extensive experiments on CUB-200-2011,Stanford Cars and FGVC Aircraft datasets illustrate the promising performance of our framework.展开更多
Fine-grained recognition of ships based on remote sensing images is crucial to safeguarding maritime rights and interests and maintaining national security.Currently,with the emergence of massive high-resolution multi...Fine-grained recognition of ships based on remote sensing images is crucial to safeguarding maritime rights and interests and maintaining national security.Currently,with the emergence of massive high-resolution multi-modality images,the use of multi-modality images for fine-grained recognition has become a promising technology.Fine-grained recognition of multi-modality images imposes higher requirements on the dataset samples.The key to the problem is how to extract and fuse the complementary features of multi-modality images to obtain more discriminative fusion features.The attention mechanism helps the model to pinpoint the key information in the image,resulting in a significant improvement in the model’s performance.In this paper,a dataset for fine-grained recognition of ships based on visible and near-infrared multi-modality remote sensing images has been proposed first,named Dataset for Multimodal Fine-grained Recognition of Ships(DMFGRS).It includes 1,635 pairs of visible and near-infrared remote sensing images divided into 20 categories,collated from digital orthophotos model provided by commercial remote sensing satellites.DMFGRS provides two types of annotation format files,as well as segmentation mask images corresponding to the ship targets.Then,a Multimodal Information Cross-Enhancement Network(MICE-Net)fusing features of visible and near-infrared remote sensing images,has been proposed.In the network,a dual-branch feature extraction and fusion module has been designed to obtain more expressive features.The Feature Cross Enhancement Module(FCEM)achieves the fusion enhancement of the two modal features by making the channel attention and spatial attention work cross-functionally on the feature map.A benchmark is established by evaluating state-of-the-art object recognition algorithms on DMFGRS.MICE-Net conducted experiments on DMFGRS,and the precision,recall,mAP0.5 and mAP0.5:0.95 reached 87%,77.1%,83.8%and 63.9%,respectively.Extensive experiments demonstrate that the proposed MICE-Net has more excellent performance on DMFGRS.Built on lightweight network YOLO,the model has excellent generalizability,and thus has good potential for application in real-life scenarios.展开更多
Behavior recognition of Hu sheep contributes to their intensive and intelligent farming.Due to the generally high density of Hu sheep farming,severe occlusion occurs among different behaviors and even among sheep perf...Behavior recognition of Hu sheep contributes to their intensive and intelligent farming.Due to the generally high density of Hu sheep farming,severe occlusion occurs among different behaviors and even among sheep performing the same behavior,leading to missing and false detection issues in existing behavior recognition methods.A high-low frequency aggregated attention and negative sample comprehensive score loss and comprehensive score soft non-maximum suppression-YOLO(HLNC-YOLO)was proposed for identifying the behavior of Hu sheep,addressing the issues of missed and erroneous detections caused by occlusion between Hu sheep in intensive farming.Firstly,images of four typical behaviors-standing,lying,eating,and drinking-were collected from the sheep farm to construct the Hu sheep behavior dataset(HSBD).Next,to solve the occlusion issues,during the training phase,the C2F-HLAtt module was integrated,which combined high-low frequency aggregation attention,into the YOLO v8 Backbone to perceive occluded objects and introduce an auxiliary reversible branch to retain more effective features.Using comprehensive score regression loss(CSLoss)to reduce the scores of suboptimal boxes and enhance the comprehensive scores of occluded object boxes.Finally,the soft comprehensive score non-maximal suppression(Soft-CS-NMS)algorithm filtered prediction boxes during the inferencing.Testing on the HSBD,HLNC-YOLO achieved a mean average precision(mAP@50)of 87.8%,with a memory footprint of 17.4 MB.This represented an improvement of 7.1,2.2,4.6,and 11 percentage points over YOLO v8,YOLO v9,YOLO v10,and Faster R-CNN,respectively.Research indicated that the HLNC-YOLO accurately identified the behavior of Hu sheep in intensive farming and possessed generalization capabilities,providing technical support for smart farming.展开更多
Through tracing the background and customary usage of classification of fine-grained sedimentary rocks and terminology,and comparing current“sedimentary petrology”textbooks and monographs,this paper proposes a class...Through tracing the background and customary usage of classification of fine-grained sedimentary rocks and terminology,and comparing current“sedimentary petrology”textbooks and monographs,this paper proposes a classification scheme for fine-grained sedimentary rocks and clarifies related terminology.The comprehensive analysis indicates that the classification of clastic rocks,volcanic clastic rocks,chemical rocks,and biogenic(carbonate)rocks is unified,and the definitions of terms such as lamination,bedding and beds are consistent.However,there is a disagreement on the definition of“mud”.European and American scholars commonly use the term“mud”to include silt and clay(particle size less than 0.0625 mm).Chinese scholars equate the term“mud”to“clay”(particle size less than 0.0039 mm or less than 0.01 mm).Combined with the discussion on terms such as sedimentary structures(bedding,lamination and lamellation),shale,mudstone,mudrocks/argillaceous rocks and mud shale,it is recommended to use“fine-grained sedimentary rocks”as the general term for all sedimentary rocks composed of fine-grained materials with particle size less than 0.0625 mm,including claystone/mudrocks and siltstone.Claystone/mudrocks are further classified into argillaceous(or clayey)mudstone/shale,calcareous mudstone/shale,siliceous mudstone/shale,silty mudstone/shale and silt-containing mudstone/shale.Argillaceous(or clayey)mudstone/shale emphasizes a content of clay minerals or clay-sized particles exceeding 50%.Other mudstones/shales emphasize a content of particles(particle size less than 0.0625 mm)exceeding 50%.The commonly referred term“shale”should not include siltstone.It is necessary to establish a reasonable,standardized,and applicable classification scheme for fine-grained sedimentary rocks in the future.An integrated shale microfacies research at the thin-section scale should be carried out,and combined with well logging data interpretation and seismic attribute analysis,a geological model of lithology/lithofacies will be iteratively upgraded to accurately determine sweet layer,locate target layer,and evaluate favorable area.展开更多
A fine-grained metastable dual-phase Fe_(40)Mn_(20)Co_(20)Cr_(15)Si_(5)high entropy alloy(CS-HEA)with excellent strength and ductility was successfully prepared by friction stir processing(FSP).The microstructural and...A fine-grained metastable dual-phase Fe_(40)Mn_(20)Co_(20)Cr_(15)Si_(5)high entropy alloy(CS-HEA)with excellent strength and ductility was successfully prepared by friction stir processing(FSP).The microstructural and mechanical properties of the fine-grained CS-HEA were characterized.The results showed that as-cast shrinkage cavities and elemental segregation were eliminated.The average grain size was refined from 121.1 to 5.4μm.The face-centered cubic phase fraction increased from 23%to 82%.During tensile deformation,dislocation slip dominated at strains ranging from 5%to 17%,followed by transformation induced plasticity(TRIP)from 17%to 26%,and twin induced plasticity(TWIP)from 26%to 37%.The yield strength,ultimate tensile strength,and elongation of the fine-grained CS-HEA were 503 MPa,1120 MPa,and 37%,respectively.The strength-ductility synergy of fine-grained CS-HEA was attributed to the combined effects of TRIP,TWIP,dislocation strengthening,and fine-grained strengthening.展开更多
Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest....Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest.However,Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length.In addition,Conformerbased architectures may not provide sufficient flexibility for modeling local dependencies at different granularities.To mitigate these limitations,this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer(RSG-Conformer)architecture.Specifically,we propose a Global-enhanced Sparse Attention(GSA)module incorporating an efficient context restoration block to recover lost contextual cues.Concurrently,a Grouped-scale Convolution(GSC)module replaces the standard Conformer convolution module,providing adaptive local modeling across varying temporal resolutions.Furthermore,we integrate a Refined Intermediate Contextual CTC(RIC-CTC)supervision strategy.This approach applies progressively increasing loss weights combined with convolution-based context aggregation,thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks.Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach,with word error rates(WERs)reduced to 1.8%and 1.5%,respectively.These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.展开更多
To address crop depredation by intelligent species(e.t,macaques)and the habituation from traditional methods,this study proposes an intelligent,closed-loop,adaptive laser deterrence system.A core contribution is an ef...To address crop depredation by intelligent species(e.t,macaques)and the habituation from traditional methods,this study proposes an intelligent,closed-loop,adaptive laser deterrence system.A core contribution is an efficient multi-stage Semi-Supervised Learning(SSL)and incremental fine-tuning(IFT)framework,which reduced manual annotation by~60%and training time by~68%.This framework was benchmarked against YOLOv8n,v10n,and v11n.Our analysis revealed that YOLOv12n’s high Signal-to-Noise Ratio(SNR)(47.1%retention)pseudo-labels made it the onlymodel to gain performance(+0.010mAP)fromSSL,allowing it to overtake competitors.Subsequently,in the IFT stress test,YOLOv12n proved most robust(a minimal−0.019 mAP decline),whereas YOLOv10n suffered catastrophic failure(−0.233mAP),highlighting its incompatibility with IFT.Thefinalmodel achieved high performance(mAP@0.5 of 0.947 for macaques,0.946 for laser spots).In Multi-Object Tracking(MOT),this study quantitatively confirms that Bottom-Up Tracking by Sorting(BoT-SORT)(1.88 s avg.tracklet lifetime)significantly outperforms ByteTrack(0.81 s)in identity preservation for visually similar macaques.System integration achieved 480 Frames Per Second(FPS)real-time inference on edge devices.A quadratic polynomial fittingmodel ensured high-precision aiming(RMSE<2 pixels;best 1.2 pixels)by compensating for distortion.To fundamentally solve habituation,an adaptive strategy driven by a Deep Deterministic Policy Gradient(DDPG)framework was introduced.By using a habituation penalty term(Rhabituation)to force unpredictable sequences,theDDPGstrategy achieved a stable 88%average Intrusion Frequency Reduction Rate(IFRR)in field experiments,suppressing habituation in highly intelligent species.This study develops an efficient,precise,low-cost,and habituation-resistant automated wildlife defense system.展开更多
Recognising human-object interactions(HOI)is a challenging task for traditional machine learning models,including convolutional neural networks(CNNs).Existing models show limited transferability across complex dataset...Recognising human-object interactions(HOI)is a challenging task for traditional machine learning models,including convolutional neural networks(CNNs).Existing models show limited transferability across complex datasets such as D3D-HOI and SYSU 3D HOI.The conventional architecture of CNNs restricts their ability to handle HOI scenarios with high complexity.HOI recognition requires improved feature extraction methods to overcome the current limitations in accuracy and scalability.This work proposes a Novel quantum gate-enabled hybrid CNN(QEH-CNN)for effectiveHOI recognition.Themodel enhancesCNNperformance by integrating quantumcomputing components.The framework begins with bilateral image filtering,followed bymulti-object tracking(MOT)and Felzenszwalb superpixel segmentation.A watershed algorithm refines object boundaries by cleaning merged superpixels.Feature extraction combines a histogram of oriented gradients(HOG),Global Image Statistics for Texture(GIST)descriptors,and a novel 23-joint keypoint extractionmethod using relative joint angles and joint proximitymeasures.A fuzzy optimization process refines the extracted features before feeding them into the QEH-CNNmodel.The proposed model achieves 95.06%accuracy on the 3D-D3D-HOI dataset and 97.29%on the SYSU3DHOI dataset.Theintegration of quantum computing enhances feature optimization,leading to improved accuracy and overall model efficiency.展开更多
Sign language is a primary mode of communication for individuals with hearing impairments,conveying meaning through hand shapes and hand movements.Contrary to spoken or written languages,sign language relies on the re...Sign language is a primary mode of communication for individuals with hearing impairments,conveying meaning through hand shapes and hand movements.Contrary to spoken or written languages,sign language relies on the recognition and interpretation of hand gestures captured in video data.However,sign language datasets remain relatively limited compared to those of other languages,which hinders the training and performance of deep learning models.Additionally,the distinct word order of sign language,unlike that of spoken language,requires context-aware and natural sentence generation.To address these challenges,this study applies data augmentation techniques to build a Korean Sign Language dataset and train recognition models.Recognized words are then reconstructed into complete sentences.The sign recognition process uses OpenCV and MediaPipe to extract hand landmarks from sign language videos and analyzes hand position,orientation,and motion.The extracted features are converted into time-series data and fed into a Long Short-Term Memory(LSTM)model.The proposed recognition framework achieved an accuracy of up to 81.25%,while the sentence generation achieved an accuracy of up to 95%.The proposed approach is expected to be applicable not only to Korean Sign Language but also to other low-resource sign languages for recognition and translation tasks.展开更多
Scene recognition is a critical component of computer vision,powering applications from autonomous vehicles to surveillance systems.However,its development is often constrained by a heavy reliance on large,expensively...Scene recognition is a critical component of computer vision,powering applications from autonomous vehicles to surveillance systems.However,its development is often constrained by a heavy reliance on large,expensively annotated datasets.This research presents a novel,efficient approach that leveragesmulti-model transfer learning from pre-trained deep neural networks—specifically DenseNet201 and Visual Geometry Group(VGG)—to overcome this limitation.Ourmethod significantly reduces dependency on vast labeled data while achieving high accuracy.Evaluated on the Aerial Image Dataset(AID)dataset,the model attained a validation accuracy of 93.6%with a loss of 0.35,demonstrating robust performance with minimal training data.These results underscore the viability of our approach for real-time,data-efficient scene recognition,offering a practical and cost-effective advancement for the field.展开更多
Objective To develop a depression recognition model by integrating the spirit-expression diagnostic framework of traditional Chinese medicine(TCM)with machine learning algorithms.The proposed model seeks to establish ...Objective To develop a depression recognition model by integrating the spirit-expression diagnostic framework of traditional Chinese medicine(TCM)with machine learning algorithms.The proposed model seeks to establish a TCM-informed tool for early depression screening,thereby bridging traditional diagnostic principles with modern computational approaches.Methods The study included patients with depression who visited the Shanghai Pudong New Area Mental Health Center from October 1,2022 to October 1,2023,as well as students and teachers from Shanghai University of Traditional Chinese Medicine during the same period as the healthy control group.Videos of 3–10 s were captured using a Xiaomi Pad 5,and the TCM spirit and expressions were determined by TCM experts(at least 3 out of 5 experts agreed to determine the category of TCM spirit and expressions).Basic information,facial images,and interview information were collected through a portable TCM intelligent analysis and diagnosis device,and facial diagnosis features were extracted using the Open CV computer vision library technology.Statistical analysis methods such as parametric and non-parametric tests were used to analyze the baseline data,TCM spirit and expression features,and facial diagnosis feature parameters of the two groups,to compare the differences in TCM spirit and expression and facial features.Five machine learning algorithms,including extreme gradient boosting(XGBoost),decision tree(DT),Bernoulli naive Bayes(BernoulliNB),support vector machine(SVM),and k-nearest neighbor(KNN)classification,were used to construct a depression recognition model based on the fusion of TCM spirit and expression features.The performance of the model was evaluated using metrics such as accuracy,precision,and the area under the receiver operating characteristic(ROC)curve(AUC).The model results were explained using the Shapley Additive exPlanations(SHAP).Results A total of 93 depression patients and 87 healthy individuals were ultimately included in this study.There was no statistically significant difference in the baseline characteristics between the two groups(P>0.05).The differences in the characteristics of the spirit and expressions in TCM and facial features between the two groups were shown as follows.(i)Quantispirit facial analysis revealed that depression patients exhibited significantly reduced facial spirit and luminance compared with healthy controls(P<0.05),with characteristic features such as sad expressions,facial erythema,and changes in the lip color ranging from erythematous to cyanotic.(ii)Depressed patients exhibited significantly lower values in facial complexion L,lip L,and a values,and gloss index,but higher values in facial complexion a and b,lip b,low gloss index,and matte index(all P<0.05).(iii)The results of multiple models show that the XGBoost-based depression recognition model,integrating the TCM“spirit-expression”diagnostic framework,achieved an accuracy of 98.61%and significantly outperformed four benchmark algorithms—DT,BernoulliNB,SVM,and KNN(P<0.01).(iv)The SHAP visualization results show that in the recognition model constructed by the XGBoost algorithm,the complexion b value,categories of facial spirit,high gloss index,low gloss index,categories of facial expression and texture features have significant contribution to the model.Conclusion This study demonstrates that integrating TCM spirit-expression diagnostic features with machine learning enables the construction of a high-precision depression detection model,offering a novel paradigm for objective depression diagnosis.展开更多
Video emotion recognition is widely used due to its alignment with the temporal characteristics of human emotional expression,but existingmodels have significant shortcomings.On the one hand,Transformermultihead self-...Video emotion recognition is widely used due to its alignment with the temporal characteristics of human emotional expression,but existingmodels have significant shortcomings.On the one hand,Transformermultihead self-attention modeling of global temporal dependency has problems of high computational overhead and feature similarity.On the other hand,fixed-size convolution kernels are often used,which have weak perception ability for emotional regions of different scales.Therefore,this paper proposes a video emotion recognition model that combines multi-scale region-aware convolution with temporal interactive sampling.In terms of space,multi-branch large-kernel stripe convolution is used to perceive emotional region features at different scales,and attention weights are generated for each scale feature.In terms of time,multi-layer odd-even down-sampling is performed on the time series,and oddeven sub-sequence interaction is performed to solve the problem of feature similarity,while reducing computational costs due to the linear relationship between sampling and convolution overhead.This paper was tested on CMU-MOSI,CMU-MOSEI,and Hume Reaction.The Acc-2 reached 83.4%,85.2%,and 81.2%,respectively.The experimental results show that the model can significantly improve the accuracy of emotion recognition.展开更多
Gait recognition is a key biometric for long-distance identification,yet its performance is severely degraded by real-world challenges such as varying clothing,carrying conditions,and changing viewpoints.While combini...Gait recognition is a key biometric for long-distance identification,yet its performance is severely degraded by real-world challenges such as varying clothing,carrying conditions,and changing viewpoints.While combining silhouette and skeleton data is a promising direction,effectively fusing these heterogeneous modalities and adaptively weighting their contributions in response to diverse conditions remains a central problem.This paper introduces GaitMAFF,a novelMulti-modal Adaptive Feature Fusion Network,to address this challenge.Our approach first transforms discrete skeleton joints into a dense SkeletonMap representation to align with silhouettes,then employs an attention-based module to dynamically learn the fusion weights between the two modalities.These fused features are processed by a powerful spatio-temporal backbone withWeighted Global-Local Feature FusionModules(WFFM)to learn a discriminative representation.Extensive experiments on the challenging CCPG and Gait3D datasets show that GaitMAFF achieves state-of-the-art performance,with an average Rank-1 accuracy of 84.6%on CCPG and 58.7%on Gait3D.These results demonstrate that our adaptive fusion strategy effectively integrates complementary multimodal information,significantly enhancing gait recognition robustness and accuracy in complex scenes and providing a practical solution for real-world applications.展开更多
Person recognition in photo collections is a critical yet challenging task in computer vision.Previous studies have used social relationships within photo collections to address this issue.However,these methods often ...Person recognition in photo collections is a critical yet challenging task in computer vision.Previous studies have used social relationships within photo collections to address this issue.However,these methods often fail when performing single-person-in-photos recognition in photo collections,as they cannot rely on social connections for recognition.In this work,we discard social relationships and instead measure the relationships between photos to solve this problem.We designed a new model that includes a multi-parameter attention network for adaptively fusing visual features and a unified formula for measuring photo intimacy.This model effectively recognizes individuals in single photo within the collection.Due to outdated annotations and missing photos in the existing PIPA(Person in Photo Album)dataset,wemanually re-annotated it and added approximately ten thousand photos of Asian individuals to address the underrepresentation issue.Our results on the re-annotated PIPA dataset are superior to previous studies in most cases,and experiments on the supplemented dataset further demonstrate the effectiveness of our method.We have made the PIPA dataset publicly available on Zenodo,with the DOI:10.5281/zenodo.12508096(accessed on 15 October 2025).展开更多
Human Activity Recognition(HAR)is a novel area for computer vision.It has a great impact on healthcare,smart environments,and surveillance while is able to automatically detect human behavior.It plays a vital role in ...Human Activity Recognition(HAR)is a novel area for computer vision.It has a great impact on healthcare,smart environments,and surveillance while is able to automatically detect human behavior.It plays a vital role in many applications,such as smart home,healthcare,human computer interaction,sports analysis,and especially,intelligent surveillance.In this paper,we propose a robust and efficient HAR system by leveraging deep learning paradigms,including pre-trained models,CNN architectures,and their average-weighted fusion.However,due to the diversity of human actions and various environmental influences,as well as a lack of data and resources,achieving high recognition accuracy remain elusive.In this work,a weighted average ensemble technique is employed to fuse three deep learning models:EfficientNet,ResNet50,and a custom CNN.The results of this study indicate that using a weighted average ensemble strategy for developing more effective HAR models may be a promising idea for detection and classification of human activities.Experiments by using the benchmark dataset proved that the proposed weighted ensemble approach outperformed existing approaches in terms of accuracy and other key performance measures.The combined average-weighted ensemble of pre-trained and CNN models obtained an accuracy of 98%,compared to 97%,96%,and 95%for the customized CNN,EfficientNet,and ResNet50 models,respectively.展开更多
This study presents a hybrid CNN-Transformer model for real-time recognition of affective tactile biosignals.The proposed framework combines convolutional neural networks(CNNs)to extract spatial and local temporal fea...This study presents a hybrid CNN-Transformer model for real-time recognition of affective tactile biosignals.The proposed framework combines convolutional neural networks(CNNs)to extract spatial and local temporal features with the Transformer encoder that captures long-range dependencies in time-series data through multi-head attention.Model performance was evaluated on two widely used tactile biosignal datasets,HAART and CoST,which contain diverse affective touch gestures recorded from pressure sensor arrays.TheCNN-Transformer model achieved recognition rates of 93.33%on HAART and 80.89%on CoST,outperforming existing methods on both benchmarks.By incorporating temporal windowing,the model enables instantaneous prediction,improving generalization across gestures of varying duration.These results highlight the effectiveness of deep learning for tactile biosignal processing and demonstrate the potential of theCNN-Transformer approach for future applications in wearable sensors,affective computing,and biomedical monitoring.展开更多
The initial noise present in the depth images obtained with RGB-D sensors is a combination of hardware limitations in addition to the environmental factors,due to the limited capabilities of sensors,which also produce...The initial noise present in the depth images obtained with RGB-D sensors is a combination of hardware limitations in addition to the environmental factors,due to the limited capabilities of sensors,which also produce poor computer vision results.The common image denoising techniques tend to remove significant image details and also remove noise,provided they are based on space and frequency filtering.The updated framework presented in this paper is a novel denoising model that makes use of Boruta-driven feature selection using a Long Short-Term Memory Autoencoder(LSTMAE).The Boruta algorithm identifies the most useful depth features that are used to maximize the spatial structure integrity and reduce redundancy.An LSTMAE is then used to process these selected features and model depth pixel sequences to generate robust,noise-resistant representations.The system uses the encoder to encode the input data into a latent space that has been compressed before it is decoded to retrieve the clean image.Experiments on a benchmark data set show that the suggested technique attains a PSNR of 45 dB and an SSIM of 0.90,which is 10 dB higher than the performance of conventional convolutional autoencoders and 15 times higher than that of the wavelet-based models.Moreover,the feature selection step will decrease the input dimensionality by 40%,resulting in a 37.5%reduction in training time and a real-time inference rate of 200 FPS.Boruta-LSTMAE framework,therefore,offers a highly efficient and scalable system for depth image denoising,with a high potential to be applied to close-range 3D systems,such as robotic manipulation and gesture-based interfaces.展开更多
基金supported,in part,by the National Nature Science Foundation of China under Grant 62272236,62376128 and 62306139the Natural Science Foundation of Jiangsu Province under Grant BK20201136,BK20191401.
文摘Discriminative region localization and efficient feature encoding are crucial for fine-grained object recognition.However,existing data augmentation methods struggle to accurately locate discriminative regions in complex backgrounds,small target objects,and limited training data,leading to poor recognition.Fine-grained images exhibit“small inter-class differences,”and while second-order feature encoding enhances discrimination,it often requires dual Convolutional Neural Networks(CNN),increasing training time and complexity.This study proposes a model integrating discriminative region localization and efficient second-order feature encoding.By ranking feature map channels via a fully connected layer,it selects high-importance channels to generate an enhanced map,accurately locating discriminative regions.Cropping and erasing augmentations further refine recognition.To improve efficiency,a novel second-order feature encoding module generates an attention map from the fourth convolutional group of Residual Network 50 layers(ResNet-50)and multiplies it with features from the fifth group,producing second-order features while reducing dimensionality and training time.Experiments on Caltech-University of California,San Diego Birds-200-2011(CUB-200-2011),Stanford Car,and Fine-Grained Visual Classification of Aircraft(FGVC Aircraft)datasets show state-of-the-art accuracy of 88.9%,94.7%,and 93.3%,respectively.
基金supported by the National Natural Science Foundation of China(No.62302540),with author Fangfang Shan for more information,please visit their website at https://www.nsfc.gov.cn/(accessed on 05 June 2024)Additionally,it is also funded by the Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness(No.HNTS2022020),where Fangfang Shan is an author.Further details can be found at http://xt.hnkjt.gov.cn/data/pingtai/(accessed on 05 June 2024)the Natural Science Foundation of Henan Province Youth Science Fund Project(No.232300420422),and for more information,you can visit https://kjt.henan.gov.cn(accessed on 05 June 2024).
文摘With the emergence and development of social networks,people can stay in touch with friends,family,and colleagues more quickly and conveniently,regardless of their location.This ubiquitous digital internet environment has also led to large-scale disclosure of personal privacy.Due to the complexity and subtlety of sensitive information,traditional sensitive information identification technologies cannot thoroughly address the characteristics of each piece of data,thus weakening the deep connections between text and images.In this context,this paper adopts the CLIP model as a modality discriminator.By using comparative learning between sensitive image descriptions and images,the similarity between the images and the sensitive descriptions is obtained to determine whether the images contain sensitive information.This provides the basis for identifying sensitive information using different modalities.Specifically,if the original data does not contain sensitive information,only single-modality text-sensitive information identification is performed;if the original data contains sensitive information,multimodality sensitive information identification is conducted.This approach allows for differentiated processing of each piece of data,thereby achieving more accurate sensitive information identification.The aforementioned modality discriminator can address the limitations of existing sensitive information identification technologies,making the identification of sensitive information from the original data more appropriate and precise.
基金funded by the Guangzhou Development Zone Science and Technology Project(2023GH02)the University of Macao(MYRG2022-00271-FST)research grants by the Science and Technology Development Fund of Macao(0032/2022/A)and(0019/2025/RIB1).
文摘Accurately recognizing driver distraction is critical for preventing traffic accidents,yet current detection models face two persistent challenges.First,distractions are often fine-grained,involving subtle cues such as brief eye closures or partial yawns,which are easily missed by conventional detectors.Second,in real-world scenarios,drivers frequently exhibit overlapping behaviors,such as simultaneously holding a cup,closing their eyes,and yawning,leading tomultiple detection boxes and degradedmodel performance.Existing approaches fail to robustly address these complexities,resulting in limited reliability in safety critical applications.To overcome these pain points,we propose YOLO-Drive,a novel framework that enhances YOLO-based driver monitoring with EfficientViM and Polarized Spectral–Spatial Attention(PSSA)modules.Efficient ViMprovides lightweight yet powerful global–local feature extraction,enabling accurate recognition of subtle driver states.PSSA further amplifies discriminative features across spatial and spectral domains,ensuring robust separation of concurrent distraction cues.By explicitly modeling fine-grained and overlapping behaviors,our approach delivers significant improvements in both precision and robustness.Extensive experiments on benchmark driver distraction datasets demonstrate that YOLO-Drive consistently out-performs stateof-the-art models,achieving higher detection accuracy while maintaining real-time efficiency.These results validate YOLO-Drive as a practical and reliable solution for advanced driver monitoring systems,addressing long-standing challenges of subtle cue recognition and multi-cue distraction detection.
基金supported by the National Natural Science Foundation of China,China (Grants No.62171232)the Priority Academic Program Development of Jiangsu Higher Education Institutions,China。
文摘Fine-grained Image Recognition(FGIR)task is dedicated to distinguishing similar sub-categories that belong to the same super-category,such as bird species and car types.In order to highlight visual differences,existing FGIR works often follow two steps:discriminative sub-region localization and local feature representation.However,these works pay less attention on global context information.They neglect a fact that the subtle visual difference in challenging scenarios can be highlighted through exploiting the spatial relationship among different subregions from a global view point.Therefore,in this paper,we consider both global and local information for FGIR,and propose a collaborative teacher-student strategy to reinforce and unity the two types of information.Our framework is implemented mainly by convolutional neural network,referred to Teacher-Student Based Attention Convolutional Neural Network(T-S-ACNN).For fine-grained local information,we choose the classic Multi-Attention Network(MA-Net)as our baseline,and propose a type of boundary constraint to further reduce background noises in the local attention maps.In this way,the discriminative sub-regions tend to appear in the area occupied by fine-grained objects,leading to more accurate sub-region localization.For fine-grained global information,we design a graph convolution based Global Attention Network(GA-Net),which can combine extracted local attention maps from MA-Net with non-local techniques to explore spatial relationship among subregions.At last,we develop a collaborative teacher-student strategy to adaptively determine the attended roles and optimization modes,so as to enhance the cooperative reinforcement of MA-Net and GA-Net.Extensive experiments on CUB-200-2011,Stanford Cars and FGVC Aircraft datasets illustrate the promising performance of our framework.
文摘Fine-grained recognition of ships based on remote sensing images is crucial to safeguarding maritime rights and interests and maintaining national security.Currently,with the emergence of massive high-resolution multi-modality images,the use of multi-modality images for fine-grained recognition has become a promising technology.Fine-grained recognition of multi-modality images imposes higher requirements on the dataset samples.The key to the problem is how to extract and fuse the complementary features of multi-modality images to obtain more discriminative fusion features.The attention mechanism helps the model to pinpoint the key information in the image,resulting in a significant improvement in the model’s performance.In this paper,a dataset for fine-grained recognition of ships based on visible and near-infrared multi-modality remote sensing images has been proposed first,named Dataset for Multimodal Fine-grained Recognition of Ships(DMFGRS).It includes 1,635 pairs of visible and near-infrared remote sensing images divided into 20 categories,collated from digital orthophotos model provided by commercial remote sensing satellites.DMFGRS provides two types of annotation format files,as well as segmentation mask images corresponding to the ship targets.Then,a Multimodal Information Cross-Enhancement Network(MICE-Net)fusing features of visible and near-infrared remote sensing images,has been proposed.In the network,a dual-branch feature extraction and fusion module has been designed to obtain more expressive features.The Feature Cross Enhancement Module(FCEM)achieves the fusion enhancement of the two modal features by making the channel attention and spatial attention work cross-functionally on the feature map.A benchmark is established by evaluating state-of-the-art object recognition algorithms on DMFGRS.MICE-Net conducted experiments on DMFGRS,and the precision,recall,mAP0.5 and mAP0.5:0.95 reached 87%,77.1%,83.8%and 63.9%,respectively.Extensive experiments demonstrate that the proposed MICE-Net has more excellent performance on DMFGRS.Built on lightweight network YOLO,the model has excellent generalizability,and thus has good potential for application in real-life scenarios.
文摘Behavior recognition of Hu sheep contributes to their intensive and intelligent farming.Due to the generally high density of Hu sheep farming,severe occlusion occurs among different behaviors and even among sheep performing the same behavior,leading to missing and false detection issues in existing behavior recognition methods.A high-low frequency aggregated attention and negative sample comprehensive score loss and comprehensive score soft non-maximum suppression-YOLO(HLNC-YOLO)was proposed for identifying the behavior of Hu sheep,addressing the issues of missed and erroneous detections caused by occlusion between Hu sheep in intensive farming.Firstly,images of four typical behaviors-standing,lying,eating,and drinking-were collected from the sheep farm to construct the Hu sheep behavior dataset(HSBD).Next,to solve the occlusion issues,during the training phase,the C2F-HLAtt module was integrated,which combined high-low frequency aggregation attention,into the YOLO v8 Backbone to perceive occluded objects and introduce an auxiliary reversible branch to retain more effective features.Using comprehensive score regression loss(CSLoss)to reduce the scores of suboptimal boxes and enhance the comprehensive scores of occluded object boxes.Finally,the soft comprehensive score non-maximal suppression(Soft-CS-NMS)algorithm filtered prediction boxes during the inferencing.Testing on the HSBD,HLNC-YOLO achieved a mean average precision(mAP@50)of 87.8%,with a memory footprint of 17.4 MB.This represented an improvement of 7.1,2.2,4.6,and 11 percentage points over YOLO v8,YOLO v9,YOLO v10,and Faster R-CNN,respectively.Research indicated that the HLNC-YOLO accurately identified the behavior of Hu sheep in intensive farming and possessed generalization capabilities,providing technical support for smart farming.
基金Supported by the Integrated Project of National Natural Science Foundation and Enterprise Innovation Development Joint Foundation(U24B6004)。
文摘Through tracing the background and customary usage of classification of fine-grained sedimentary rocks and terminology,and comparing current“sedimentary petrology”textbooks and monographs,this paper proposes a classification scheme for fine-grained sedimentary rocks and clarifies related terminology.The comprehensive analysis indicates that the classification of clastic rocks,volcanic clastic rocks,chemical rocks,and biogenic(carbonate)rocks is unified,and the definitions of terms such as lamination,bedding and beds are consistent.However,there is a disagreement on the definition of“mud”.European and American scholars commonly use the term“mud”to include silt and clay(particle size less than 0.0625 mm).Chinese scholars equate the term“mud”to“clay”(particle size less than 0.0039 mm or less than 0.01 mm).Combined with the discussion on terms such as sedimentary structures(bedding,lamination and lamellation),shale,mudstone,mudrocks/argillaceous rocks and mud shale,it is recommended to use“fine-grained sedimentary rocks”as the general term for all sedimentary rocks composed of fine-grained materials with particle size less than 0.0625 mm,including claystone/mudrocks and siltstone.Claystone/mudrocks are further classified into argillaceous(or clayey)mudstone/shale,calcareous mudstone/shale,siliceous mudstone/shale,silty mudstone/shale and silt-containing mudstone/shale.Argillaceous(or clayey)mudstone/shale emphasizes a content of clay minerals or clay-sized particles exceeding 50%.Other mudstones/shales emphasize a content of particles(particle size less than 0.0625 mm)exceeding 50%.The commonly referred term“shale”should not include siltstone.It is necessary to establish a reasonable,standardized,and applicable classification scheme for fine-grained sedimentary rocks in the future.An integrated shale microfacies research at the thin-section scale should be carried out,and combined with well logging data interpretation and seismic attribute analysis,a geological model of lithology/lithofacies will be iteratively upgraded to accurately determine sweet layer,locate target layer,and evaluate favorable area.
基金the funds of the National Natural Science Fund for Excellent Young Scholars of China(No.52222410)Shaanxi Province National Science Fund for Distinguished Young Scholars,China(No.2022JC-24)the National Natural Science Foundation of China(Nos.52227807,52034005)。
文摘A fine-grained metastable dual-phase Fe_(40)Mn_(20)Co_(20)Cr_(15)Si_(5)high entropy alloy(CS-HEA)with excellent strength and ductility was successfully prepared by friction stir processing(FSP).The microstructural and mechanical properties of the fine-grained CS-HEA were characterized.The results showed that as-cast shrinkage cavities and elemental segregation were eliminated.The average grain size was refined from 121.1 to 5.4μm.The face-centered cubic phase fraction increased from 23%to 82%.During tensile deformation,dislocation slip dominated at strains ranging from 5%to 17%,followed by transformation induced plasticity(TRIP)from 17%to 26%,and twin induced plasticity(TWIP)from 26%to 37%.The yield strength,ultimate tensile strength,and elongation of the fine-grained CS-HEA were 503 MPa,1120 MPa,and 37%,respectively.The strength-ductility synergy of fine-grained CS-HEA was attributed to the combined effects of TRIP,TWIP,dislocation strengthening,and fine-grained strengthening.
基金supported in part by the National Natural Science Foundation of China:61773330.
文摘Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest.However,Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length.In addition,Conformerbased architectures may not provide sufficient flexibility for modeling local dependencies at different granularities.To mitigate these limitations,this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer(RSG-Conformer)architecture.Specifically,we propose a Global-enhanced Sparse Attention(GSA)module incorporating an efficient context restoration block to recover lost contextual cues.Concurrently,a Grouped-scale Convolution(GSC)module replaces the standard Conformer convolution module,providing adaptive local modeling across varying temporal resolutions.Furthermore,we integrate a Refined Intermediate Contextual CTC(RIC-CTC)supervision strategy.This approach applies progressively increasing loss weights combined with convolution-based context aggregation,thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks.Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach,with word error rates(WERs)reduced to 1.8%and 1.5%,respectively.These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.
基金Part of the research funding was provided by Tatung University.
文摘To address crop depredation by intelligent species(e.t,macaques)and the habituation from traditional methods,this study proposes an intelligent,closed-loop,adaptive laser deterrence system.A core contribution is an efficient multi-stage Semi-Supervised Learning(SSL)and incremental fine-tuning(IFT)framework,which reduced manual annotation by~60%and training time by~68%.This framework was benchmarked against YOLOv8n,v10n,and v11n.Our analysis revealed that YOLOv12n’s high Signal-to-Noise Ratio(SNR)(47.1%retention)pseudo-labels made it the onlymodel to gain performance(+0.010mAP)fromSSL,allowing it to overtake competitors.Subsequently,in the IFT stress test,YOLOv12n proved most robust(a minimal−0.019 mAP decline),whereas YOLOv10n suffered catastrophic failure(−0.233mAP),highlighting its incompatibility with IFT.Thefinalmodel achieved high performance(mAP@0.5 of 0.947 for macaques,0.946 for laser spots).In Multi-Object Tracking(MOT),this study quantitatively confirms that Bottom-Up Tracking by Sorting(BoT-SORT)(1.88 s avg.tracklet lifetime)significantly outperforms ByteTrack(0.81 s)in identity preservation for visually similar macaques.System integration achieved 480 Frames Per Second(FPS)real-time inference on edge devices.A quadratic polynomial fittingmodel ensured high-precision aiming(RMSE<2 pixels;best 1.2 pixels)by compensating for distortion.To fundamentally solve habituation,an adaptive strategy driven by a Deep Deterministic Policy Gradient(DDPG)framework was introduced.By using a habituation penalty term(Rhabituation)to force unpredictable sequences,theDDPGstrategy achieved a stable 88%average Intrusion Frequency Reduction Rate(IFRR)in field experiments,suppressing habituation in highly intelligent species.This study develops an efficient,precise,low-cost,and habituation-resistant automated wildlife defense system.
基金supported and funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R410),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Recognising human-object interactions(HOI)is a challenging task for traditional machine learning models,including convolutional neural networks(CNNs).Existing models show limited transferability across complex datasets such as D3D-HOI and SYSU 3D HOI.The conventional architecture of CNNs restricts their ability to handle HOI scenarios with high complexity.HOI recognition requires improved feature extraction methods to overcome the current limitations in accuracy and scalability.This work proposes a Novel quantum gate-enabled hybrid CNN(QEH-CNN)for effectiveHOI recognition.Themodel enhancesCNNperformance by integrating quantumcomputing components.The framework begins with bilateral image filtering,followed bymulti-object tracking(MOT)and Felzenszwalb superpixel segmentation.A watershed algorithm refines object boundaries by cleaning merged superpixels.Feature extraction combines a histogram of oriented gradients(HOG),Global Image Statistics for Texture(GIST)descriptors,and a novel 23-joint keypoint extractionmethod using relative joint angles and joint proximitymeasures.A fuzzy optimization process refines the extracted features before feeding them into the QEH-CNNmodel.The proposed model achieves 95.06%accuracy on the 3D-D3D-HOI dataset and 97.29%on the SYSU3DHOI dataset.Theintegration of quantum computing enhances feature optimization,leading to improved accuracy and overall model efficiency.
基金supported by the Institute of Information&Communications Technoljogy Planning&Evaluation(IITP)-Innovative Human Resource Development for Local Intellectualization Program grant funded by the Korea government(MSIT)(IITP-2026-RS-2022-00156334,50%)the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2021R1C1C2011105,50%).
文摘Sign language is a primary mode of communication for individuals with hearing impairments,conveying meaning through hand shapes and hand movements.Contrary to spoken or written languages,sign language relies on the recognition and interpretation of hand gestures captured in video data.However,sign language datasets remain relatively limited compared to those of other languages,which hinders the training and performance of deep learning models.Additionally,the distinct word order of sign language,unlike that of spoken language,requires context-aware and natural sentence generation.To address these challenges,this study applies data augmentation techniques to build a Korean Sign Language dataset and train recognition models.Recognized words are then reconstructed into complete sentences.The sign recognition process uses OpenCV and MediaPipe to extract hand landmarks from sign language videos and analyzes hand position,orientation,and motion.The extracted features are converted into time-series data and fed into a Long Short-Term Memory(LSTM)model.The proposed recognition framework achieved an accuracy of up to 81.25%,while the sentence generation achieved an accuracy of up to 95%.The proposed approach is expected to be applicable not only to Korean Sign Language but also to other low-resource sign languages for recognition and translation tasks.
基金funded by the Deanship of Scientific Research and Libraries,Princess Nourah bint Abdulrahman University,through the Program of Research Project Funding After Publication,grant No.(RPFAP-23-1445).
文摘Scene recognition is a critical component of computer vision,powering applications from autonomous vehicles to surveillance systems.However,its development is often constrained by a heavy reliance on large,expensively annotated datasets.This research presents a novel,efficient approach that leveragesmulti-model transfer learning from pre-trained deep neural networks—specifically DenseNet201 and Visual Geometry Group(VGG)—to overcome this limitation.Ourmethod significantly reduces dependency on vast labeled data while achieving high accuracy.Evaluated on the Aerial Image Dataset(AID)dataset,the model attained a validation accuracy of 93.6%with a loss of 0.35,demonstrating robust performance with minimal training data.These results underscore the viability of our approach for real-time,data-efficient scene recognition,offering a practical and cost-effective advancement for the field.
基金General Program of National Natural Science Foundation of China(82474390)Construction Project of Pudong New Area Famous TCM Studios(National Pilot Zone for TCM Development,Shanghai)(PDZY-2025-0716)Shanghai Municipal Science and Technology Program Project Shanghai Key Laboratory of Health Identification and Assessment(21DZ2271000).
文摘Objective To develop a depression recognition model by integrating the spirit-expression diagnostic framework of traditional Chinese medicine(TCM)with machine learning algorithms.The proposed model seeks to establish a TCM-informed tool for early depression screening,thereby bridging traditional diagnostic principles with modern computational approaches.Methods The study included patients with depression who visited the Shanghai Pudong New Area Mental Health Center from October 1,2022 to October 1,2023,as well as students and teachers from Shanghai University of Traditional Chinese Medicine during the same period as the healthy control group.Videos of 3–10 s were captured using a Xiaomi Pad 5,and the TCM spirit and expressions were determined by TCM experts(at least 3 out of 5 experts agreed to determine the category of TCM spirit and expressions).Basic information,facial images,and interview information were collected through a portable TCM intelligent analysis and diagnosis device,and facial diagnosis features were extracted using the Open CV computer vision library technology.Statistical analysis methods such as parametric and non-parametric tests were used to analyze the baseline data,TCM spirit and expression features,and facial diagnosis feature parameters of the two groups,to compare the differences in TCM spirit and expression and facial features.Five machine learning algorithms,including extreme gradient boosting(XGBoost),decision tree(DT),Bernoulli naive Bayes(BernoulliNB),support vector machine(SVM),and k-nearest neighbor(KNN)classification,were used to construct a depression recognition model based on the fusion of TCM spirit and expression features.The performance of the model was evaluated using metrics such as accuracy,precision,and the area under the receiver operating characteristic(ROC)curve(AUC).The model results were explained using the Shapley Additive exPlanations(SHAP).Results A total of 93 depression patients and 87 healthy individuals were ultimately included in this study.There was no statistically significant difference in the baseline characteristics between the two groups(P>0.05).The differences in the characteristics of the spirit and expressions in TCM and facial features between the two groups were shown as follows.(i)Quantispirit facial analysis revealed that depression patients exhibited significantly reduced facial spirit and luminance compared with healthy controls(P<0.05),with characteristic features such as sad expressions,facial erythema,and changes in the lip color ranging from erythematous to cyanotic.(ii)Depressed patients exhibited significantly lower values in facial complexion L,lip L,and a values,and gloss index,but higher values in facial complexion a and b,lip b,low gloss index,and matte index(all P<0.05).(iii)The results of multiple models show that the XGBoost-based depression recognition model,integrating the TCM“spirit-expression”diagnostic framework,achieved an accuracy of 98.61%and significantly outperformed four benchmark algorithms—DT,BernoulliNB,SVM,and KNN(P<0.01).(iv)The SHAP visualization results show that in the recognition model constructed by the XGBoost algorithm,the complexion b value,categories of facial spirit,high gloss index,low gloss index,categories of facial expression and texture features have significant contribution to the model.Conclusion This study demonstrates that integrating TCM spirit-expression diagnostic features with machine learning enables the construction of a high-precision depression detection model,offering a novel paradigm for objective depression diagnosis.
基金supported,in part,by the National Nature Science Foundation of China under Grant 62272236,62376128in part,by the Natural Science Foundation of Jiangsu Province under Grant BK20201136,BK20191401.
文摘Video emotion recognition is widely used due to its alignment with the temporal characteristics of human emotional expression,but existingmodels have significant shortcomings.On the one hand,Transformermultihead self-attention modeling of global temporal dependency has problems of high computational overhead and feature similarity.On the other hand,fixed-size convolution kernels are often used,which have weak perception ability for emotional regions of different scales.Therefore,this paper proposes a video emotion recognition model that combines multi-scale region-aware convolution with temporal interactive sampling.In terms of space,multi-branch large-kernel stripe convolution is used to perceive emotional region features at different scales,and attention weights are generated for each scale feature.In terms of time,multi-layer odd-even down-sampling is performed on the time series,and oddeven sub-sequence interaction is performed to solve the problem of feature similarity,while reducing computational costs due to the linear relationship between sampling and convolution overhead.This paper was tested on CMU-MOSI,CMU-MOSEI,and Hume Reaction.The Acc-2 reached 83.4%,85.2%,and 81.2%,respectively.The experimental results show that the model can significantly improve the accuracy of emotion recognition.
基金funded by the Natural Science Foundation of Chongqing Municipality,grant number CSTB2022NSCQ-MSX0503.
文摘Gait recognition is a key biometric for long-distance identification,yet its performance is severely degraded by real-world challenges such as varying clothing,carrying conditions,and changing viewpoints.While combining silhouette and skeleton data is a promising direction,effectively fusing these heterogeneous modalities and adaptively weighting their contributions in response to diverse conditions remains a central problem.This paper introduces GaitMAFF,a novelMulti-modal Adaptive Feature Fusion Network,to address this challenge.Our approach first transforms discrete skeleton joints into a dense SkeletonMap representation to align with silhouettes,then employs an attention-based module to dynamically learn the fusion weights between the two modalities.These fused features are processed by a powerful spatio-temporal backbone withWeighted Global-Local Feature FusionModules(WFFM)to learn a discriminative representation.Extensive experiments on the challenging CCPG and Gait3D datasets show that GaitMAFF achieves state-of-the-art performance,with an average Rank-1 accuracy of 84.6%on CCPG and 58.7%on Gait3D.These results demonstrate that our adaptive fusion strategy effectively integrates complementary multimodal information,significantly enhancing gait recognition robustness and accuracy in complex scenes and providing a practical solution for real-world applications.
基金supported by“the Fundamental Research Funds for the Central Universities”(GrantNos.:3282025045,3282024008)“Science and Technology Project of the State ArchivesAdministration ofChina”(Grant No.:2025-Z-009).
文摘Person recognition in photo collections is a critical yet challenging task in computer vision.Previous studies have used social relationships within photo collections to address this issue.However,these methods often fail when performing single-person-in-photos recognition in photo collections,as they cannot rely on social connections for recognition.In this work,we discard social relationships and instead measure the relationships between photos to solve this problem.We designed a new model that includes a multi-parameter attention network for adaptively fusing visual features and a unified formula for measuring photo intimacy.This model effectively recognizes individuals in single photo within the collection.Due to outdated annotations and missing photos in the existing PIPA(Person in Photo Album)dataset,wemanually re-annotated it and added approximately ten thousand photos of Asian individuals to address the underrepresentation issue.Our results on the re-annotated PIPA dataset are superior to previous studies in most cases,and experiments on the supplemented dataset further demonstrate the effectiveness of our method.We have made the PIPA dataset publicly available on Zenodo,with the DOI:10.5281/zenodo.12508096(accessed on 15 October 2025).
基金supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2026R765),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Human Activity Recognition(HAR)is a novel area for computer vision.It has a great impact on healthcare,smart environments,and surveillance while is able to automatically detect human behavior.It plays a vital role in many applications,such as smart home,healthcare,human computer interaction,sports analysis,and especially,intelligent surveillance.In this paper,we propose a robust and efficient HAR system by leveraging deep learning paradigms,including pre-trained models,CNN architectures,and their average-weighted fusion.However,due to the diversity of human actions and various environmental influences,as well as a lack of data and resources,achieving high recognition accuracy remain elusive.In this work,a weighted average ensemble technique is employed to fuse three deep learning models:EfficientNet,ResNet50,and a custom CNN.The results of this study indicate that using a weighted average ensemble strategy for developing more effective HAR models may be a promising idea for detection and classification of human activities.Experiments by using the benchmark dataset proved that the proposed weighted ensemble approach outperformed existing approaches in terms of accuracy and other key performance measures.The combined average-weighted ensemble of pre-trained and CNN models obtained an accuracy of 98%,compared to 97%,96%,and 95%for the customized CNN,EfficientNet,and ResNet50 models,respectively.
文摘This study presents a hybrid CNN-Transformer model for real-time recognition of affective tactile biosignals.The proposed framework combines convolutional neural networks(CNNs)to extract spatial and local temporal features with the Transformer encoder that captures long-range dependencies in time-series data through multi-head attention.Model performance was evaluated on two widely used tactile biosignal datasets,HAART and CoST,which contain diverse affective touch gestures recorded from pressure sensor arrays.TheCNN-Transformer model achieved recognition rates of 93.33%on HAART and 80.89%on CoST,outperforming existing methods on both benchmarks.By incorporating temporal windowing,the model enables instantaneous prediction,improving generalization across gestures of varying duration.These results highlight the effectiveness of deep learning for tactile biosignal processing and demonstrate the potential of theCNN-Transformer approach for future applications in wearable sensors,affective computing,and biomedical monitoring.
文摘The initial noise present in the depth images obtained with RGB-D sensors is a combination of hardware limitations in addition to the environmental factors,due to the limited capabilities of sensors,which also produce poor computer vision results.The common image denoising techniques tend to remove significant image details and also remove noise,provided they are based on space and frequency filtering.The updated framework presented in this paper is a novel denoising model that makes use of Boruta-driven feature selection using a Long Short-Term Memory Autoencoder(LSTMAE).The Boruta algorithm identifies the most useful depth features that are used to maximize the spatial structure integrity and reduce redundancy.An LSTMAE is then used to process these selected features and model depth pixel sequences to generate robust,noise-resistant representations.The system uses the encoder to encode the input data into a latent space that has been compressed before it is decoded to retrieve the clean image.Experiments on a benchmark data set show that the suggested technique attains a PSNR of 45 dB and an SSIM of 0.90,which is 10 dB higher than the performance of conventional convolutional autoencoders and 15 times higher than that of the wavelet-based models.Moreover,the feature selection step will decrease the input dimensionality by 40%,resulting in a 37.5%reduction in training time and a real-time inference rate of 200 FPS.Boruta-LSTMAE framework,therefore,offers a highly efficient and scalable system for depth image denoising,with a high potential to be applied to close-range 3D systems,such as robotic manipulation and gesture-based interfaces.