AIM:To evaluate the efficacy of the total computer vision syndrome questionnaire(CVS-Q)score as a predictive tool for identifying individuals with symptomatic binocular vision anomalies and refractive errors.METHODS:A...AIM:To evaluate the efficacy of the total computer vision syndrome questionnaire(CVS-Q)score as a predictive tool for identifying individuals with symptomatic binocular vision anomalies and refractive errors.METHODS:A total of 141 healthy computer users underwent comprehensive clinical visual function assessments,including evaluations of refractive errors,accommodation(amplitude of accommodation,positive relative accommodation,negative relative accommodation,accommodative accuracy,and accommodative facility),and vergence(phoria,positive and negative fusional vergence,near point of convergence,and vergence facility).Total CVS-Q scores were recorded to explore potential associations between symptom scores and the aforementioned clinical visual function parameters.RESULTS:The cohort included 54 males(38.3%)with a mean age of 23.9±0.58y and 87 age-matched females(61.7%)with a mean age of 23.9±0.53y.The multiple regression model was statistically significant[R²=0.60,F=13.28,degrees of freedom(DF=17122,P<0.001].This indicates that 60%of the variance in total CVS-Q scores(reflecting reported symptoms)could be explained by four clinical measurements:amplitude of accommodation,positive relative accommodation,exophoria at distance and near,and positive fusional vergence at near.CONCLUSION:The total CVS-Q score is a valid and reliable tool for predicting the presence of various nonstrabismic binocular vision anomalies and refractive errors in symptomatic computer users.展开更多
Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step...Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step generation processes are often inefficient and difficult to control.To address these challenges,we propose CAFE-GAN,a CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination,which incorporates a pretrained CLIP model along with several key architectural innovations.First,we embed a coordinate attention mechanism into the generator to capture long-range dependencies and enhance feature representation.Second,we introduce a trainable linear projection layer after the CLIP text encoder,which aligns textual embeddings with the generator’s semantic space.Third,we design a multi-scale discriminator that leverages pre-trained visual features and integrates a feature regularization strategy,thereby improving training stability and discrimination performance.Experiments on the CUB and COCO datasets demonstrate that CAFE-GAN outperforms existing text-to-image generation methods,achieving lower Fréchet Inception Distance(FID)scores and generating images with superior visual quality and semantic fidelity,with FID scores of 9.84 and 5.62 on the CUB and COCO datasets,respectively,surpassing current state-of-the-art text-to-image models by varying degrees.These findings offer valuable insights for future research on efficient,controllable text-to-image synthesis.展开更多
AIM:To determine the prevalence of tropia,phoria,and abnormality of near point of convergence(NPC),along with associated ocular symptoms,in high school students.METHODS:This cross-sectional study was conducted in Erbi...AIM:To determine the prevalence of tropia,phoria,and abnormality of near point of convergence(NPC),along with associated ocular symptoms,in high school students.METHODS:This cross-sectional study was conducted in Erbil,Iraq.The target population consisted of high school students selected through a multi-stage cluster sampling method.Comprehensive visual examinations were performed for all students,including measurement of uncorrected and corrected visual acuity,objective and subjective refraction,and distance and near cover tests.NPC was evaluated using a single 6/12 visual target mounted on a centrally positioned Gulden fixation stick.Ocular symptoms were investigated through interviews.RESULTS:Of the 996 selected students,921 participated in the study.Of them,543(58.96%)were female,and their ages ranged from 13 to 22y.The prevalence of tropia was 3.58%[95%confidence interval(CI):2.38%-4.78%],observed in 3.44%of males and 3.68%of females.Exotropia(1.95%,95%CI:1.06%-2.85%)was more common than esotropia(1.52%,95%CI:0.73%-2.31%).The 15.42%(95%CI:13.09%-17.75%)of students had phoria.Exophoria(13.79%,95%CI:11.56%-16.02%)was significantly more prevalent than esophoria(1.63%,95%CI:0.81%-2.45%).The prevalence of NPC abnormality in the total study population was 24.97%(95%CI:22.18%-27.77%).It was 26.72%(95%CI:22.26%-31.18%)in males and 23.76%(95%CI:20.18%-27.34%)in females(P=0.307).The most common symptom in phoria was headache(86.62%,95%CI:81.02%-92.22%),followed by tired or sore eyes(61.97%,95%CI:53.99%-69.96%).The most common symptoms in tropia were blurry vision(93.94%,95%CI:79.77%-99.26%)and difficulty concentrating(87.88%,95%CI:76.74%-99.01%).CONCLUSION:Among Erbil’s high school students,the prevalence of strabismus,particularly the exodeviation type,is relatively high,and a significant percentage of students have NPC abnormalities.Addressing and correcting these binocular vision problems,due to their associated visual symptoms,can lead to an improvement in students’quality of life and academic performance.展开更多
Lung cancer remains a major global health challenge,with early diagnosis crucial for improved patient survival.Traditional diagnostic techniques,including manual histopathology and radiological assessments,are prone t...Lung cancer remains a major global health challenge,with early diagnosis crucial for improved patient survival.Traditional diagnostic techniques,including manual histopathology and radiological assessments,are prone to errors and variability.Deep learning methods,particularly Vision Transformers(ViT),have shown promise for improving diagnostic accuracy by effectively extracting global features.However,ViT-based approaches face challenges related to computational complexity and limited generalizability.This research proposes the DualSet ViT-PSO-SVM framework,integrating aViTwith dual attentionmechanisms,Particle Swarm Optimization(PSO),and SupportVector Machines(SVM),aiming for efficient and robust lung cancer classification acrossmultiple medical image datasets.The study utilized three publicly available datasets:LIDC-IDRI,LUNA16,and TCIA,encompassing computed tomography(CT)scans and histopathological images.Data preprocessing included normalization,augmentation,and segmentation.Dual attention mechanisms enhanced ViT’s feature extraction capabilities.PSO optimized feature selection,and SVM performed classification.Model performance was evaluated on individual and combined datasets,benchmarked against CNN-based and standard ViT approaches.The DualSet ViT-PSO-SVM significantly outperformed existing methods,achieving superior accuracy rates of 97.85%(LIDC-IDRI),98.32%(LUNA16),and 96.75%(TCIA).Crossdataset evaluations demonstrated strong generalization capabilities and stability across similar imagingmodalities.The proposed framework effectively bridges advanced deep learning techniques with clinical applicability,offering a robust diagnostic tool for lung cancer detection,reducing complexity,and improving diagnostic reliability and interpretability.展开更多
The rapid advancements in computer vision(CV)technology have transformed the traditional approaches to material microstructure analysis.This review outlines the history of CV and explores the applications of deep-lear...The rapid advancements in computer vision(CV)technology have transformed the traditional approaches to material microstructure analysis.This review outlines the history of CV and explores the applications of deep-learning(DL)-driven CV in four key areas of materials science:microstructure-based performance prediction,microstructure information generation,microstructure defect detection,and crystal structure-based property prediction.The CV has significantly reduced the cost of traditional experimental methods used in material performance prediction.Moreover,recent progress made in generating microstructure images and detecting microstructural defects using CV has led to increased efficiency and reliability in material performance assessments.The DL-driven CV models can accelerate the design of new materials with optimized performance by integrating predictions based on both crystal and microstructural data,thereby allowing for the discovery and innovation of next-generation materials.Finally,the review provides insights into the rapid interdisciplinary developments in the field of materials science and future prospects.展开更多
Human Activity Recognition(HAR)is a novel area for computer vision.It has a great impact on healthcare,smart environments,and surveillance while is able to automatically detect human behavior.It plays a vital role in ...Human Activity Recognition(HAR)is a novel area for computer vision.It has a great impact on healthcare,smart environments,and surveillance while is able to automatically detect human behavior.It plays a vital role in many applications,such as smart home,healthcare,human computer interaction,sports analysis,and especially,intelligent surveillance.In this paper,we propose a robust and efficient HAR system by leveraging deep learning paradigms,including pre-trained models,CNN architectures,and their average-weighted fusion.However,due to the diversity of human actions and various environmental influences,as well as a lack of data and resources,achieving high recognition accuracy remain elusive.In this work,a weighted average ensemble technique is employed to fuse three deep learning models:EfficientNet,ResNet50,and a custom CNN.The results of this study indicate that using a weighted average ensemble strategy for developing more effective HAR models may be a promising idea for detection and classification of human activities.Experiments by using the benchmark dataset proved that the proposed weighted ensemble approach outperformed existing approaches in terms of accuracy and other key performance measures.The combined average-weighted ensemble of pre-trained and CNN models obtained an accuracy of 98%,compared to 97%,96%,and 95%for the customized CNN,EfficientNet,and ResNet50 models,respectively.展开更多
Person recognition in photo collections is a critical yet challenging task in computer vision.Previous studies have used social relationships within photo collections to address this issue.However,these methods often ...Person recognition in photo collections is a critical yet challenging task in computer vision.Previous studies have used social relationships within photo collections to address this issue.However,these methods often fail when performing single-person-in-photos recognition in photo collections,as they cannot rely on social connections for recognition.In this work,we discard social relationships and instead measure the relationships between photos to solve this problem.We designed a new model that includes a multi-parameter attention network for adaptively fusing visual features and a unified formula for measuring photo intimacy.This model effectively recognizes individuals in single photo within the collection.Due to outdated annotations and missing photos in the existing PIPA(Person in Photo Album)dataset,wemanually re-annotated it and added approximately ten thousand photos of Asian individuals to address the underrepresentation issue.Our results on the re-annotated PIPA dataset are superior to previous studies in most cases,and experiments on the supplemented dataset further demonstrate the effectiveness of our method.We have made the PIPA dataset publicly available on Zenodo,with the DOI:10.5281/zenodo.12508096(accessed on 15 October 2025).展开更多
Recognising human-object interactions(HOI)is a challenging task for traditional machine learning models,including convolutional neural networks(CNNs).Existing models show limited transferability across complex dataset...Recognising human-object interactions(HOI)is a challenging task for traditional machine learning models,including convolutional neural networks(CNNs).Existing models show limited transferability across complex datasets such as D3D-HOI and SYSU 3D HOI.The conventional architecture of CNNs restricts their ability to handle HOI scenarios with high complexity.HOI recognition requires improved feature extraction methods to overcome the current limitations in accuracy and scalability.This work proposes a Novel quantum gate-enabled hybrid CNN(QEH-CNN)for effectiveHOI recognition.Themodel enhancesCNNperformance by integrating quantumcomputing components.The framework begins with bilateral image filtering,followed bymulti-object tracking(MOT)and Felzenszwalb superpixel segmentation.A watershed algorithm refines object boundaries by cleaning merged superpixels.Feature extraction combines a histogram of oriented gradients(HOG),Global Image Statistics for Texture(GIST)descriptors,and a novel 23-joint keypoint extractionmethod using relative joint angles and joint proximitymeasures.A fuzzy optimization process refines the extracted features before feeding them into the QEH-CNNmodel.The proposed model achieves 95.06%accuracy on the 3D-D3D-HOI dataset and 97.29%on the SYSU3DHOI dataset.Theintegration of quantum computing enhances feature optimization,leading to improved accuracy and overall model efficiency.展开更多
Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challe...Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challenges for de-ployment on resource-constrained edge devices.Although post-training quantization(PTQ)provides a promising solution by reducing model precision with minimal calibration data,aggressive low-bit quantization typically leads to substantial perfor-mance degradation.To address this challenge,we present the truncated uniform-log2 quantizer and progressive bit-decline reconstruction method for vision Transformer quantization(TP-ViT).It is an innovative PTQ framework specifically designed for ViTs,featuring two key technical contributions:(1)truncated uniform-log2 quantizer,a novel quantization approach which effectively handles outlier values in post-Softmax activations,significantly reducing quantization errors;(2)bit-decline optimiza-tion strategy,which employs transition weights to gradually reduce bit precision while maintaining model performance under extreme quantization conditions.Comprehensive experiments on image classification,object detection,and instance segmenta-tion tasks demonstrate TP-ViT’s superior performance compared to state-of-the-art PTQ methods,particularly in challenging 3-bit quantization scenarios.Our framework achieves a notable 6.18 percentage points improvement in top-1 accuracy for ViT-small under 3-bit quantization.These results validate TP-ViT’s robustness and general applicability,paving the way for more efficient deployment of ViT models in computer vision applications on edge hardware.展开更多
The use of Unmanned Aerial Vehicles(UAVs)for defect detection on railway slopes is becoming increasingly widespread due to their ability to capture high-resolution images over large,inaccessible,and topographically co...The use of Unmanned Aerial Vehicles(UAVs)for defect detection on railway slopes is becoming increasingly widespread due to their ability to capture high-resolution images over large,inaccessible,and topographically complex areas.However,current UAV-based detection methods face several critical limitations,including constrained deployment frequency,limited availability of annotated defect data,and the lack of mature risk assessment frameworks.To address these challenges,this study introduces a novel approach that integrates diffusion models with Large Language Models(LLMs)to generate highquality synthetic defect images tailored to railway slope scenarios.Furthermore,an improved transformerbased architecture is proposed,incorporating attention mechanisms and LLM-guided diffusion-generated imagery to enhance defect recognition performance under complex environmental conditions.Experimental evaluations conducted on a dataset of 300 field-collected images from high-risk railway slopes demonstrate that the proposed method significantly outperforms existing baselines in terms of precision,recall,and robustness,indicating strong applicability for real-world railway infrastructure monitoring and disaster prevention.展开更多
Human action recognition(HAR)is crucial for the development of efficient computer vision,where bioinspired neuromorphic perception visual systems have emerged as a vital solution to address transmission bottlenecks ac...Human action recognition(HAR)is crucial for the development of efficient computer vision,where bioinspired neuromorphic perception visual systems have emerged as a vital solution to address transmission bottlenecks across sensor-processor interfaces.However,the absence of interactions among versatile biomimicking functionalities within a single device,which was developed for specific vision tasks,restricts the computational capacity,practicality,and scalability of in-sensor vision computing.Here,we propose a bioinspired vision sensor composed of a Ga N/Al N-based ultrathin quantum-disks-in-nanowires(QD-NWs)array to mimic not only Parvo cells for high-contrast vision and Magno cells for dynamic vision in the human retina but also the synergistic activity between the two cells for in-sensor vision computing.By simply tuning the applied bias voltage on each QD-NW-array-based pixel,we achieve two biosimilar photoresponse characteristics with slow and fast reactions to light stimuli that enhance the in-sensor image quality and HAR efficiency,respectively.Strikingly,the interplay and synergistic interaction of the two photoresponse modes within a single device markedly increased the HAR recognition accuracy from 51.4%to 81.4%owing to the integrated artificial vision system.The demonstration of an intelligent vision sensor offers a promising device platform for the development of highly efficient HAR systems and future smart optoelectronics.展开更多
Low-light image enhancement aims to improve the visibility of severely degraded images captured under insufficient illumination,alleviating the adverse effects of illumination degradation on image quality.Traditional ...Low-light image enhancement aims to improve the visibility of severely degraded images captured under insufficient illumination,alleviating the adverse effects of illumination degradation on image quality.Traditional Retinex-based approaches,inspired by human visual perception of brightness and color,decompose an image into illumination and reflectance components to restore fine details.However,their limited capacity for handling noise and complex lighting conditions often leads to distortions and artifacts in the enhanced results,particularly under extreme low-light scenarios.Although deep learning methods built upon Retinex theory have recently advanced the field,most still suffer frominsufficient interpretability and sub-optimal enhancement performance.This paper presents RetinexWT,a novel framework that tightly integrates classical Retinex theory with modern deep learning.Following Retinex principles,RetinexWT employs wavelet transforms to estimate illumination maps for brightness adjustment.A detail-recovery module that synergistically combines Vision Transformer(ViT)and wavelet transforms is then introduced to guide the restoration of lost details,thereby improving overall image quality.Within the framework,wavelet decomposition splits input features into high-frequency and low-frequency components,enabling scale-specific processing of global illumination/color cues and fine textures.Furthermore,a gating mechanism selectively fuses down-sampled and up-sampled features,while an attention-based fusion strategy enhances model interpretability.Extensive experiments on the LOL dataset demonstrate that RetinexWT surpasses existing Retinex-oriented deeplearning methods,achieving an average Peak Signal-to-Noise Ratio(PSNR)improvement of 0.22 dB over the current StateOfTheArt(SOTA),thereby confirming its superiority in low-light image enhancement.Code is available at https://github.com/CHEN-hJ516/RetinexWT(accessed on 14 October 2025).展开更多
Common strong noise interferences like metal splashes,smoke,and arc light during welding can seriously pollute the laser stripe images,causing the tracking model to drift and leading to tracking failure.At present,the...Common strong noise interferences like metal splashes,smoke,and arc light during welding can seriously pollute the laser stripe images,causing the tracking model to drift and leading to tracking failure.At present,there are already many mature methods for identifying and extracting feature points of linear laser stripes.When the laser stripe forms a curved shape on the surface of the workpiece,these linear methods will no longer be applicable.To eliminate interference sources,enhance the robustness of the weld tracking model,and effectively extract the feature points of curved laser stripes under strong noise conditions.This paper proposes a Conditional Generative Adversarial Network(CGAN)based anti-interference recognition method for welding images.The generator adopts an improved U-Net++structure,adds a Multi-scale Channel Attention module(MS-CAM),introduces Deep Supervision,and proposes a Multi-output Fusion strategy(MOFS)in the output result to en-hance the image inpainting effect;the discriminator uses PatchGAN.The center of the laser stripe is obtained using the grayscale center of mass method and then combined with polynomial fitting to extract the feature points of the weld seam.The experimental results show that the PSNR of the inpainting image is 26.24 dB,the SSIM is 0.98,and the LPIPS is 0.032.The centerline of the inpainting image and the centerline of the noise-free image laser stripe are fitted with a curve.The error of centerline feature points is no more than 5%,confirming the superiority and feasibility of the method.展开更多
基金Supported by Ongoing Research Funding Program(ORFFT-2025-054-1),King Saud University,Riyadh,Saudi Arabia.
文摘AIM:To evaluate the efficacy of the total computer vision syndrome questionnaire(CVS-Q)score as a predictive tool for identifying individuals with symptomatic binocular vision anomalies and refractive errors.METHODS:A total of 141 healthy computer users underwent comprehensive clinical visual function assessments,including evaluations of refractive errors,accommodation(amplitude of accommodation,positive relative accommodation,negative relative accommodation,accommodative accuracy,and accommodative facility),and vergence(phoria,positive and negative fusional vergence,near point of convergence,and vergence facility).Total CVS-Q scores were recorded to explore potential associations between symptom scores and the aforementioned clinical visual function parameters.RESULTS:The cohort included 54 males(38.3%)with a mean age of 23.9±0.58y and 87 age-matched females(61.7%)with a mean age of 23.9±0.53y.The multiple regression model was statistically significant[R²=0.60,F=13.28,degrees of freedom(DF=17122,P<0.001].This indicates that 60%of the variance in total CVS-Q scores(reflecting reported symptoms)could be explained by four clinical measurements:amplitude of accommodation,positive relative accommodation,exophoria at distance and near,and positive fusional vergence at near.CONCLUSION:The total CVS-Q score is a valid and reliable tool for predicting the presence of various nonstrabismic binocular vision anomalies and refractive errors in symptomatic computer users.
文摘Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step generation processes are often inefficient and difficult to control.To address these challenges,we propose CAFE-GAN,a CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination,which incorporates a pretrained CLIP model along with several key architectural innovations.First,we embed a coordinate attention mechanism into the generator to capture long-range dependencies and enhance feature representation.Second,we introduce a trainable linear projection layer after the CLIP text encoder,which aligns textual embeddings with the generator’s semantic space.Third,we design a multi-scale discriminator that leverages pre-trained visual features and integrates a feature regularization strategy,thereby improving training stability and discrimination performance.Experiments on the CUB and COCO datasets demonstrate that CAFE-GAN outperforms existing text-to-image generation methods,achieving lower Fréchet Inception Distance(FID)scores and generating images with superior visual quality and semantic fidelity,with FID scores of 9.84 and 5.62 on the CUB and COCO datasets,respectively,surpassing current state-of-the-art text-to-image models by varying degrees.These findings offer valuable insights for future research on efficient,controllable text-to-image synthesis.
文摘AIM:To determine the prevalence of tropia,phoria,and abnormality of near point of convergence(NPC),along with associated ocular symptoms,in high school students.METHODS:This cross-sectional study was conducted in Erbil,Iraq.The target population consisted of high school students selected through a multi-stage cluster sampling method.Comprehensive visual examinations were performed for all students,including measurement of uncorrected and corrected visual acuity,objective and subjective refraction,and distance and near cover tests.NPC was evaluated using a single 6/12 visual target mounted on a centrally positioned Gulden fixation stick.Ocular symptoms were investigated through interviews.RESULTS:Of the 996 selected students,921 participated in the study.Of them,543(58.96%)were female,and their ages ranged from 13 to 22y.The prevalence of tropia was 3.58%[95%confidence interval(CI):2.38%-4.78%],observed in 3.44%of males and 3.68%of females.Exotropia(1.95%,95%CI:1.06%-2.85%)was more common than esotropia(1.52%,95%CI:0.73%-2.31%).The 15.42%(95%CI:13.09%-17.75%)of students had phoria.Exophoria(13.79%,95%CI:11.56%-16.02%)was significantly more prevalent than esophoria(1.63%,95%CI:0.81%-2.45%).The prevalence of NPC abnormality in the total study population was 24.97%(95%CI:22.18%-27.77%).It was 26.72%(95%CI:22.26%-31.18%)in males and 23.76%(95%CI:20.18%-27.34%)in females(P=0.307).The most common symptom in phoria was headache(86.62%,95%CI:81.02%-92.22%),followed by tired or sore eyes(61.97%,95%CI:53.99%-69.96%).The most common symptoms in tropia were blurry vision(93.94%,95%CI:79.77%-99.26%)and difficulty concentrating(87.88%,95%CI:76.74%-99.01%).CONCLUSION:Among Erbil’s high school students,the prevalence of strabismus,particularly the exodeviation type,is relatively high,and a significant percentage of students have NPC abnormalities.Addressing and correcting these binocular vision problems,due to their associated visual symptoms,can lead to an improvement in students’quality of life and academic performance.
文摘Lung cancer remains a major global health challenge,with early diagnosis crucial for improved patient survival.Traditional diagnostic techniques,including manual histopathology and radiological assessments,are prone to errors and variability.Deep learning methods,particularly Vision Transformers(ViT),have shown promise for improving diagnostic accuracy by effectively extracting global features.However,ViT-based approaches face challenges related to computational complexity and limited generalizability.This research proposes the DualSet ViT-PSO-SVM framework,integrating aViTwith dual attentionmechanisms,Particle Swarm Optimization(PSO),and SupportVector Machines(SVM),aiming for efficient and robust lung cancer classification acrossmultiple medical image datasets.The study utilized three publicly available datasets:LIDC-IDRI,LUNA16,and TCIA,encompassing computed tomography(CT)scans and histopathological images.Data preprocessing included normalization,augmentation,and segmentation.Dual attention mechanisms enhanced ViT’s feature extraction capabilities.PSO optimized feature selection,and SVM performed classification.Model performance was evaluated on individual and combined datasets,benchmarked against CNN-based and standard ViT approaches.The DualSet ViT-PSO-SVM significantly outperformed existing methods,achieving superior accuracy rates of 97.85%(LIDC-IDRI),98.32%(LUNA16),and 96.75%(TCIA).Crossdataset evaluations demonstrated strong generalization capabilities and stability across similar imagingmodalities.The proposed framework effectively bridges advanced deep learning techniques with clinical applicability,offering a robust diagnostic tool for lung cancer detection,reducing complexity,and improving diagnostic reliability and interpretability.
基金financially supported by the National Science Fund for Distinguished Young Scholars,China(No.52025041)the National Natural Science Foundation of China(Nos.52450003,U2341267,and 52174294)+1 种基金the National Postdoctoral Program for Innovative Talents,China(No.BX20240437)the Fundamental Research Funds for the Central Universities,China(Nos.FRF-IDRY-23-037 and FRF-TP-20-02C2)。
文摘The rapid advancements in computer vision(CV)technology have transformed the traditional approaches to material microstructure analysis.This review outlines the history of CV and explores the applications of deep-learning(DL)-driven CV in four key areas of materials science:microstructure-based performance prediction,microstructure information generation,microstructure defect detection,and crystal structure-based property prediction.The CV has significantly reduced the cost of traditional experimental methods used in material performance prediction.Moreover,recent progress made in generating microstructure images and detecting microstructural defects using CV has led to increased efficiency and reliability in material performance assessments.The DL-driven CV models can accelerate the design of new materials with optimized performance by integrating predictions based on both crystal and microstructural data,thereby allowing for the discovery and innovation of next-generation materials.Finally,the review provides insights into the rapid interdisciplinary developments in the field of materials science and future prospects.
基金supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2026R765),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Human Activity Recognition(HAR)is a novel area for computer vision.It has a great impact on healthcare,smart environments,and surveillance while is able to automatically detect human behavior.It plays a vital role in many applications,such as smart home,healthcare,human computer interaction,sports analysis,and especially,intelligent surveillance.In this paper,we propose a robust and efficient HAR system by leveraging deep learning paradigms,including pre-trained models,CNN architectures,and their average-weighted fusion.However,due to the diversity of human actions and various environmental influences,as well as a lack of data and resources,achieving high recognition accuracy remain elusive.In this work,a weighted average ensemble technique is employed to fuse three deep learning models:EfficientNet,ResNet50,and a custom CNN.The results of this study indicate that using a weighted average ensemble strategy for developing more effective HAR models may be a promising idea for detection and classification of human activities.Experiments by using the benchmark dataset proved that the proposed weighted ensemble approach outperformed existing approaches in terms of accuracy and other key performance measures.The combined average-weighted ensemble of pre-trained and CNN models obtained an accuracy of 98%,compared to 97%,96%,and 95%for the customized CNN,EfficientNet,and ResNet50 models,respectively.
基金supported by“the Fundamental Research Funds for the Central Universities”(GrantNos.:3282025045,3282024008)“Science and Technology Project of the State ArchivesAdministration ofChina”(Grant No.:2025-Z-009).
文摘Person recognition in photo collections is a critical yet challenging task in computer vision.Previous studies have used social relationships within photo collections to address this issue.However,these methods often fail when performing single-person-in-photos recognition in photo collections,as they cannot rely on social connections for recognition.In this work,we discard social relationships and instead measure the relationships between photos to solve this problem.We designed a new model that includes a multi-parameter attention network for adaptively fusing visual features and a unified formula for measuring photo intimacy.This model effectively recognizes individuals in single photo within the collection.Due to outdated annotations and missing photos in the existing PIPA(Person in Photo Album)dataset,wemanually re-annotated it and added approximately ten thousand photos of Asian individuals to address the underrepresentation issue.Our results on the re-annotated PIPA dataset are superior to previous studies in most cases,and experiments on the supplemented dataset further demonstrate the effectiveness of our method.We have made the PIPA dataset publicly available on Zenodo,with the DOI:10.5281/zenodo.12508096(accessed on 15 October 2025).
基金supported and funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R410),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Recognising human-object interactions(HOI)is a challenging task for traditional machine learning models,including convolutional neural networks(CNNs).Existing models show limited transferability across complex datasets such as D3D-HOI and SYSU 3D HOI.The conventional architecture of CNNs restricts their ability to handle HOI scenarios with high complexity.HOI recognition requires improved feature extraction methods to overcome the current limitations in accuracy and scalability.This work proposes a Novel quantum gate-enabled hybrid CNN(QEH-CNN)for effectiveHOI recognition.Themodel enhancesCNNperformance by integrating quantumcomputing components.The framework begins with bilateral image filtering,followed bymulti-object tracking(MOT)and Felzenszwalb superpixel segmentation.A watershed algorithm refines object boundaries by cleaning merged superpixels.Feature extraction combines a histogram of oriented gradients(HOG),Global Image Statistics for Texture(GIST)descriptors,and a novel 23-joint keypoint extractionmethod using relative joint angles and joint proximitymeasures.A fuzzy optimization process refines the extracted features before feeding them into the QEH-CNNmodel.The proposed model achieves 95.06%accuracy on the 3D-D3D-HOI dataset and 97.29%on the SYSU3DHOI dataset.Theintegration of quantum computing enhances feature optimization,leading to improved accuracy and overall model efficiency.
基金supported by the National Natural Science Foundation of China(Nos.62301092 and 62301093).
文摘Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challenges for de-ployment on resource-constrained edge devices.Although post-training quantization(PTQ)provides a promising solution by reducing model precision with minimal calibration data,aggressive low-bit quantization typically leads to substantial perfor-mance degradation.To address this challenge,we present the truncated uniform-log2 quantizer and progressive bit-decline reconstruction method for vision Transformer quantization(TP-ViT).It is an innovative PTQ framework specifically designed for ViTs,featuring two key technical contributions:(1)truncated uniform-log2 quantizer,a novel quantization approach which effectively handles outlier values in post-Softmax activations,significantly reducing quantization errors;(2)bit-decline optimiza-tion strategy,which employs transition weights to gradually reduce bit precision while maintaining model performance under extreme quantization conditions.Comprehensive experiments on image classification,object detection,and instance segmenta-tion tasks demonstrate TP-ViT’s superior performance compared to state-of-the-art PTQ methods,particularly in challenging 3-bit quantization scenarios.Our framework achieves a notable 6.18 percentage points improvement in top-1 accuracy for ViT-small under 3-bit quantization.These results validate TP-ViT’s robustness and general applicability,paving the way for more efficient deployment of ViT models in computer vision applications on edge hardware.
基金supported in part by the National Natural Science Foundation of China under Grant 52432012in part by the Shanghai Science and Technology Project with 25ZR1402508。
文摘The use of Unmanned Aerial Vehicles(UAVs)for defect detection on railway slopes is becoming increasingly widespread due to their ability to capture high-resolution images over large,inaccessible,and topographically complex areas.However,current UAV-based detection methods face several critical limitations,including constrained deployment frequency,limited availability of annotated defect data,and the lack of mature risk assessment frameworks.To address these challenges,this study introduces a novel approach that integrates diffusion models with Large Language Models(LLMs)to generate highquality synthetic defect images tailored to railway slope scenarios.Furthermore,an improved transformerbased architecture is proposed,incorporating attention mechanisms and LLM-guided diffusion-generated imagery to enhance defect recognition performance under complex environmental conditions.Experimental evaluations conducted on a dataset of 300 field-collected images from high-risk railway slopes demonstrate that the proposed method significantly outperforms existing baselines in terms of precision,recall,and robustness,indicating strong applicability for real-world railway infrastructure monitoring and disaster prevention.
基金funded by the National Natural Science Foundation of China(Grant Nos.62322410,52272168,624B2135,61804047)the Fundamental Research Funds for the Central Universities(No.WK2030000103)。
文摘Human action recognition(HAR)is crucial for the development of efficient computer vision,where bioinspired neuromorphic perception visual systems have emerged as a vital solution to address transmission bottlenecks across sensor-processor interfaces.However,the absence of interactions among versatile biomimicking functionalities within a single device,which was developed for specific vision tasks,restricts the computational capacity,practicality,and scalability of in-sensor vision computing.Here,we propose a bioinspired vision sensor composed of a Ga N/Al N-based ultrathin quantum-disks-in-nanowires(QD-NWs)array to mimic not only Parvo cells for high-contrast vision and Magno cells for dynamic vision in the human retina but also the synergistic activity between the two cells for in-sensor vision computing.By simply tuning the applied bias voltage on each QD-NW-array-based pixel,we achieve two biosimilar photoresponse characteristics with slow and fast reactions to light stimuli that enhance the in-sensor image quality and HAR efficiency,respectively.Strikingly,the interplay and synergistic interaction of the two photoresponse modes within a single device markedly increased the HAR recognition accuracy from 51.4%to 81.4%owing to the integrated artificial vision system.The demonstration of an intelligent vision sensor offers a promising device platform for the development of highly efficient HAR systems and future smart optoelectronics.
基金supported in part by the National Natural Science Foundation of China[Grant number 62471075]the Major Science and Technology Project Grant of the Chongqing Municipal Education Commission[Grant number KJZD-M202301901].
文摘Low-light image enhancement aims to improve the visibility of severely degraded images captured under insufficient illumination,alleviating the adverse effects of illumination degradation on image quality.Traditional Retinex-based approaches,inspired by human visual perception of brightness and color,decompose an image into illumination and reflectance components to restore fine details.However,their limited capacity for handling noise and complex lighting conditions often leads to distortions and artifacts in the enhanced results,particularly under extreme low-light scenarios.Although deep learning methods built upon Retinex theory have recently advanced the field,most still suffer frominsufficient interpretability and sub-optimal enhancement performance.This paper presents RetinexWT,a novel framework that tightly integrates classical Retinex theory with modern deep learning.Following Retinex principles,RetinexWT employs wavelet transforms to estimate illumination maps for brightness adjustment.A detail-recovery module that synergistically combines Vision Transformer(ViT)and wavelet transforms is then introduced to guide the restoration of lost details,thereby improving overall image quality.Within the framework,wavelet decomposition splits input features into high-frequency and low-frequency components,enabling scale-specific processing of global illumination/color cues and fine textures.Furthermore,a gating mechanism selectively fuses down-sampled and up-sampled features,while an attention-based fusion strategy enhances model interpretability.Extensive experiments on the LOL dataset demonstrate that RetinexWT surpasses existing Retinex-oriented deeplearning methods,achieving an average Peak Signal-to-Noise Ratio(PSNR)improvement of 0.22 dB over the current StateOfTheArt(SOTA),thereby confirming its superiority in low-light image enhancement.Code is available at https://github.com/CHEN-hJ516/RetinexWT(accessed on 14 October 2025).
基金Supported by the"The 14th Five Year Plan"Hubei Provincial ad-vantaged characteristic disciplines(groups)project of Wuhan University of Science and Technology(Grant No.2023B0404)National Natural Science Foundation of China(Grant Nos.52275503 and 72471181)+2 种基金Hubei Provincial Outstanding Youth Fund of China(Grant No.2023AFA092)Hubei Provincial Natural Science Foundation of China(Grant No.2023AFB915)Hubei Provincial Key Research and Development Plan Project of China(Grant No.2023BAB048).
文摘Common strong noise interferences like metal splashes,smoke,and arc light during welding can seriously pollute the laser stripe images,causing the tracking model to drift and leading to tracking failure.At present,there are already many mature methods for identifying and extracting feature points of linear laser stripes.When the laser stripe forms a curved shape on the surface of the workpiece,these linear methods will no longer be applicable.To eliminate interference sources,enhance the robustness of the weld tracking model,and effectively extract the feature points of curved laser stripes under strong noise conditions.This paper proposes a Conditional Generative Adversarial Network(CGAN)based anti-interference recognition method for welding images.The generator adopts an improved U-Net++structure,adds a Multi-scale Channel Attention module(MS-CAM),introduces Deep Supervision,and proposes a Multi-output Fusion strategy(MOFS)in the output result to en-hance the image inpainting effect;the discriminator uses PatchGAN.The center of the laser stripe is obtained using the grayscale center of mass method and then combined with polynomial fitting to extract the feature points of the weld seam.The experimental results show that the PSNR of the inpainting image is 26.24 dB,the SSIM is 0.98,and the LPIPS is 0.032.The centerline of the inpainting image and the centerline of the noise-free image laser stripe are fitted with a curve.The error of centerline feature points is no more than 5%,confirming the superiority and feasibility of the method.