Linxia brick carving is an artistic carrier of multi-ethnic cultural intermingling,but its symbolic abstraction and diversity make digital conservation challenging.Currently,the traditional qualitative recording metho...Linxia brick carving is an artistic carrier of multi-ethnic cultural intermingling,but its symbolic abstraction and diversity make digital conservation challenging.Currently,the traditional qualitative recording methods are unable to realize dynamic analysis and innovative applications.This study builds a framework for the integration of vector representation and multimodal semantic mapping,and uses that framework to quantify the historical semantics,artistic fusion,and technological features of Linxia brick carving cultural heritage by constructing a 26-dimensional vector space.This approach allowed us to solve the semantic heterogeneity of the textual-image data through the help of structured descriptive templates.The results show that this framework can support the systematic analysis and innovation of Linxia brick carving cultural symbols with high classification accuracy and reveal the structured semantic association of patterns.This study realizes the transformation of abstract symbols to computable values through the generalized 26-dimensional vectors,and can use standardized templates to regulate their digital expressions,depending on multimodal data sets that establish the multidimensional innovation of artificial intelligence-driven protection mechanisms.The results can provide methodological support for the shift in cultural heritage from static records to living inheritance,and demonstrate potential transferability to analogous heritage contexts through dimensional remapping and template localization strategies.These advances can promote the deep integration of artificial intelligence and traditional art symbols,and thus support research on the protection strategies for traditional cultural heritage in the era of digitalization.展开更多
The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and ge...The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and genomics to enhance clinical decision-making.This systematic review explores the core methodologies and applied research frontiers of medical MLLMs,focusing on their architecture,training methods,evaluation techniques,and applications.We highlight the transformative potential of MLLMs in achieving cross-modal semantic alignment,medical knowledge integration,and robust clinical reasoning.Despite their promise,challenges such as data heterogeneity,hallucination,and computational efficiency persist.By reviewing state-of-the-art solutions and future directions,this paper provides a comprehensive technical guide for developing reliable and interpretable medical MLLMs,ultimately aiming to bridge the gap between AI and clinical practice.展开更多
With growing urban areas,the climate continues to change as a result of growing populations,and hence,the demand for better emergency response systems has become more important than ever.Human Behaviour Classi.cation(...With growing urban areas,the climate continues to change as a result of growing populations,and hence,the demand for better emergency response systems has become more important than ever.Human Behaviour Classi.cation(HBC)systems have started to play a vital role by analysing data from di.erent sources to detect signs of emergencies.These systems are being used inmany critical areas like healthcare,public safety,and disastermanagement to improve response time and to prepare ahead of time.But detecting human behaviour in such stressful conditions is not simple;it o.en comes with noisy data,missing information,and the need to react in real time.This review takes a deeper look at HBC research published between 2020 and 2025.and aims to answer.ve speci.c research questions.These questions cover the types of emergencies discussed in the literature,the datasets and sensors used,the e.ectiveness of machine learning(ML)and deep learning(DL)models,and the limitations that still exist in this.eld.We explored 120 papers that used di.erent types of datasets,some were based on sensor data,others on social media,and a few used hybrid approaches.Commonly used models included CNNs,LSTMs,and reinforcement learning methods to identify behaviours.Though a lot of progress has been made,the review found ongoing issues in combining sensors properly,reacting fast enough,and using more diverse datasets.Overall,from the.ndings we observed,the focus should be on building systems that use multiple sensors together,gather real-time data on a large scale,and produce results that are easier to interpret.Proper attention to privacy and ethical concerns needs to be addressed as well.展开更多
With the increasing frequency of floods,in-depth flood event analyses are essential for effective disaster relief and prevention.Satellite-based flood event datasets have become the primary data source for flood event...With the increasing frequency of floods,in-depth flood event analyses are essential for effective disaster relief and prevention.Satellite-based flood event datasets have become the primary data source for flood event analyses instead of limited disaster maps due to their enhanced availability.Nevertheless,despite the vast amount of available remote sensing images,existing flood event datasets continue to pose significant challenges in flood event analyses due to the uneven geographical distribution of data,the scarcity of time series data,and the limited availability of flood-related semantic information.There has been a surge in acceptance of deep learning models for flood event analyses,but some existing flood datasets do not align well with model training,and distinguishing flooded areas has proven difficult with limited data modalities and semantic information.Moreover,efficient retrieval and pre-screening of flood-related imagery from vast satellite data impose notable obstacles,particularly within large-scale analyses.To address these issues,we propose a Multimodal Flood Event Dataset(MFED)for deep-learning-based flood event analyses and data retrieval.It consists of 18 years of multi-source remote sensing imagery and heterogeneous textual information covering flood-prone areas worldwide.Incorporating optical and radar imagery can exploit the correlation and complementarity between distinct image modalities to capture the pixel features in flood imagery.It is worth noting that text modality data,including auxiliary hydrological information extracted from the Global Flood Database and text information refined from online news records,can also offer a semantic supplement to the images for flood event retrieval and analysis.To verify the applicability of the MFED in deep learning models,we carried out experiments with different models using a single modality and different combinations of modalities,which fully verified the effectiveness of the dataset.Furthermore,we also verify the efficiency of the MFED in comparative experiments with existing multimodal datasets and diverse neural network structures.展开更多
Emotion Recognition in Conversations(ERC)is fundamental in creating emotionally intelligentmachines.Graph-BasedNetwork(GBN)models have gained popularity in detecting conversational contexts for ERC tasks.However,their...Emotion Recognition in Conversations(ERC)is fundamental in creating emotionally intelligentmachines.Graph-BasedNetwork(GBN)models have gained popularity in detecting conversational contexts for ERC tasks.However,their limited ability to collect and acquire contextual information hinders their effectiveness.We propose a Text Augmentation-based computational model for recognizing emotions using transformers(TA-MERT)to address this.The proposed model uses the Multimodal Emotion Lines Dataset(MELD),which ensures a balanced representation for recognizing human emotions.Themodel used text augmentation techniques to producemore training data,improving the proposed model’s accuracy.Transformer encoders train the deep neural network(DNN)model,especially Bidirectional Encoder(BE)representations that capture both forward and backward contextual information.This integration improves the accuracy and robustness of the proposed model.Furthermore,we present a method for balancing the training dataset by creating enhanced samples from the original dataset.By balancing the dataset across all emotion categories,we can lessen the adverse effects of data imbalance on the accuracy of the proposed model.Experimental results on the MELD dataset show that TA-MERT outperforms earlier methods,achieving a weighted F1 score of 62.60%and an accuracy of 64.36%.Overall,the proposed TA-MERT model solves the GBN models’weaknesses in obtaining contextual data for ERC.TA-MERT model recognizes human emotions more accurately by employing text augmentation and transformer-based encoding.The balanced dataset and the additional training samples also enhance its resilience.These findings highlight the significance of transformer-based approaches for special emotion recognition in conversations.展开更多
Teacher emotion recognition(TER)has a significant impact on student engagement,classroom atmosphere,and teaching quality,which is a research hotspot in the smart education area.However,existing studies lack high-quali...Teacher emotion recognition(TER)has a significant impact on student engagement,classroom atmosphere,and teaching quality,which is a research hotspot in the smart education area.However,existing studies lack high-quality multimodal datasets and neglect common and discriminative features of multimodal data in emotion expression.To address these challenges,this research constructs a multimodal TER dataset suitable for real classroom teaching scenarios.TER dataset contains a total of 102 lessons and 2,170 video segments from multiple educational stages and subjects,innovatively labelled with emotional tags that characterize teacher‒student interactions,such as satisfaction and questions.To explore the characteristics of multimodal data in emotion expression,this research proposes an emotion dual-space network(EDSN)that establishes an emotion commonality space construction(ECSC)module and an emotion discrimination space construction(EDSC)module.Specifically,the EDSN utilizes central moment differences to measure the similarity to assess the correlation between multiple modalities within the emotion commonality space.On this basis,the gradient reversal layer and orthogonal projection are further utilized to construct the EDSC to extract unique emotional information and remove redundant information from each modality.Experimental results demonstrate that the EDSN achieves an accuracy of 0.770 and a weighted F1 score of 0.769 on the TER dataset,outperforming other comparative models.展开更多
Purposes:To develop a bilingual multimodal visual question answering(VQA)benchmark for evaluating Vision-language models(VLMs)in ophthalmology.Methods:In this cross-sectional study,ophthalmic image posts and associate...Purposes:To develop a bilingual multimodal visual question answering(VQA)benchmark for evaluating Vision-language models(VLMs)in ophthalmology.Methods:In this cross-sectional study,ophthalmic image posts and associated captions published between Jan 1,2016,and Dec 31,2024,were collected from WeChat Official Accounts.Based on these captions,bilingual question-answer(QA)pairs in Chinese and English were generated using GPT-4o-mini.QA pairs were categorized into six subsets by question type and language:binary(Binary_CN,Binary_EN),single-choice(Singlechoice_CN,Single-choice_EN),and open-ended(Open-ended_CN,Open-ended_EN).The benchmark was used to evaluate six VLMs:GPT-4o,Gemini 2.0 Flash,Qwen2.5-VL-72B-Instruct,Janus-Pro-7B,InternVL3-8B,and HealthGPT-L14.Primary outcome was overall accuracy;secondary outcomes included subset-,subspeciality-,and modality-specific accuracy.Performance on open-ended questions were also quantified using languagebased metrics,including AlignScore,BARTScore,BERTScore,BLEU,CIDEr,METEOR,and ROUGE_L.Error types in open-ended responses were manually analyzed through stratified sampling.Results:OphthalWeChat included 3469 images and 30120 QA pairs cover 9 ophthalmic subspecialties,548 conditions,29 imaging modalities,and 68 modality combinations.Gemini 2.0 Flash achieved the highest overall accuracy(0.555),significantly outperforming GPT-4o(0.527),Qwen2.5-VL-72B-Instruct(0.520),HealthGPTL14(0.502),InternVL3-L14(0.453),and Janus-Pro-7B(0.333)(all P<0.001).It also led in both Chinese(0.551)and English subsets(0.559).By subset,Gemini 2.0 Flash excelled in Binary_CN(0.687)and Singlechoice_CN(0.666);HealthGPT-L14 performed best in Single-choice_EN(0.739);while GPT-4o ranked highest in Binary_EN(0.717),Open-ended_CN(0.254),and Open-ended_EN(0.271).Language-based metrics showed inconsistent rankings relative to accuracy in open-ended subsets.Performance varied across subspecialties and modalities,with Gemini 2.0 Flash leading in 6 of 9 subspecialties and 11 of top-15 imaging modalities.Error types analysis revealed lesion/diagnosis errors as the most frequent(35.6%-50.6%),followed by anatomical location errors(28.3%-37.5%).Conclusions:This study presents the first bilingual VQA benchmark for ophthalmology,distinguished by its realworld context and inclusion of multiple examinations per patient.The dataset enables quantitative evaluation of VLMs,supporting the development of accurate and specialized AI systems for eye care.展开更多
基金The 2022 General Project of Gansu Provincial Philosophy and Social Sciences Planning(2022YB034)。
文摘Linxia brick carving is an artistic carrier of multi-ethnic cultural intermingling,but its symbolic abstraction and diversity make digital conservation challenging.Currently,the traditional qualitative recording methods are unable to realize dynamic analysis and innovative applications.This study builds a framework for the integration of vector representation and multimodal semantic mapping,and uses that framework to quantify the historical semantics,artistic fusion,and technological features of Linxia brick carving cultural heritage by constructing a 26-dimensional vector space.This approach allowed us to solve the semantic heterogeneity of the textual-image data through the help of structured descriptive templates.The results show that this framework can support the systematic analysis and innovation of Linxia brick carving cultural symbols with high classification accuracy and reveal the structured semantic association of patterns.This study realizes the transformation of abstract symbols to computable values through the generalized 26-dimensional vectors,and can use standardized templates to regulate their digital expressions,depending on multimodal data sets that establish the multidimensional innovation of artificial intelligence-driven protection mechanisms.The results can provide methodological support for the shift in cultural heritage from static records to living inheritance,and demonstrate potential transferability to analogous heritage contexts through dimensional remapping and template localization strategies.These advances can promote the deep integration of artificial intelligence and traditional art symbols,and thus support research on the protection strategies for traditional cultural heritage in the era of digitalization.
基金supported by the National Natural Science Foundation of China(Grant No.:62172458).
文摘The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and genomics to enhance clinical decision-making.This systematic review explores the core methodologies and applied research frontiers of medical MLLMs,focusing on their architecture,training methods,evaluation techniques,and applications.We highlight the transformative potential of MLLMs in achieving cross-modal semantic alignment,medical knowledge integration,and robust clinical reasoning.Despite their promise,challenges such as data heterogeneity,hallucination,and computational efficiency persist.By reviewing state-of-the-art solutions and future directions,this paper provides a comprehensive technical guide for developing reliable and interpretable medical MLLMs,ultimately aiming to bridge the gap between AI and clinical practice.
文摘With growing urban areas,the climate continues to change as a result of growing populations,and hence,the demand for better emergency response systems has become more important than ever.Human Behaviour Classi.cation(HBC)systems have started to play a vital role by analysing data from di.erent sources to detect signs of emergencies.These systems are being used inmany critical areas like healthcare,public safety,and disastermanagement to improve response time and to prepare ahead of time.But detecting human behaviour in such stressful conditions is not simple;it o.en comes with noisy data,missing information,and the need to react in real time.This review takes a deeper look at HBC research published between 2020 and 2025.and aims to answer.ve speci.c research questions.These questions cover the types of emergencies discussed in the literature,the datasets and sensors used,the e.ectiveness of machine learning(ML)and deep learning(DL)models,and the limitations that still exist in this.eld.We explored 120 papers that used di.erent types of datasets,some were based on sensor data,others on social media,and a few used hybrid approaches.Commonly used models included CNNs,LSTMs,and reinforcement learning methods to identify behaviours.Though a lot of progress has been made,the review found ongoing issues in combining sensors properly,reacting fast enough,and using more diverse datasets.Overall,from the.ndings we observed,the focus should be on building systems that use multiple sensors together,gather real-time data on a large scale,and produce results that are easier to interpret.Proper attention to privacy and ethical concerns needs to be addressed as well.
基金supported by the National Natural Science Foundation of China[Grant No.42071413]the GHfund C[Grant No.202302039381].
文摘With the increasing frequency of floods,in-depth flood event analyses are essential for effective disaster relief and prevention.Satellite-based flood event datasets have become the primary data source for flood event analyses instead of limited disaster maps due to their enhanced availability.Nevertheless,despite the vast amount of available remote sensing images,existing flood event datasets continue to pose significant challenges in flood event analyses due to the uneven geographical distribution of data,the scarcity of time series data,and the limited availability of flood-related semantic information.There has been a surge in acceptance of deep learning models for flood event analyses,but some existing flood datasets do not align well with model training,and distinguishing flooded areas has proven difficult with limited data modalities and semantic information.Moreover,efficient retrieval and pre-screening of flood-related imagery from vast satellite data impose notable obstacles,particularly within large-scale analyses.To address these issues,we propose a Multimodal Flood Event Dataset(MFED)for deep-learning-based flood event analyses and data retrieval.It consists of 18 years of multi-source remote sensing imagery and heterogeneous textual information covering flood-prone areas worldwide.Incorporating optical and radar imagery can exploit the correlation and complementarity between distinct image modalities to capture the pixel features in flood imagery.It is worth noting that text modality data,including auxiliary hydrological information extracted from the Global Flood Database and text information refined from online news records,can also offer a semantic supplement to the images for flood event retrieval and analysis.To verify the applicability of the MFED in deep learning models,we carried out experiments with different models using a single modality and different combinations of modalities,which fully verified the effectiveness of the dataset.Furthermore,we also verify the efficiency of the MFED in comparative experiments with existing multimodal datasets and diverse neural network structures.
文摘Emotion Recognition in Conversations(ERC)is fundamental in creating emotionally intelligentmachines.Graph-BasedNetwork(GBN)models have gained popularity in detecting conversational contexts for ERC tasks.However,their limited ability to collect and acquire contextual information hinders their effectiveness.We propose a Text Augmentation-based computational model for recognizing emotions using transformers(TA-MERT)to address this.The proposed model uses the Multimodal Emotion Lines Dataset(MELD),which ensures a balanced representation for recognizing human emotions.Themodel used text augmentation techniques to producemore training data,improving the proposed model’s accuracy.Transformer encoders train the deep neural network(DNN)model,especially Bidirectional Encoder(BE)representations that capture both forward and backward contextual information.This integration improves the accuracy and robustness of the proposed model.Furthermore,we present a method for balancing the training dataset by creating enhanced samples from the original dataset.By balancing the dataset across all emotion categories,we can lessen the adverse effects of data imbalance on the accuracy of the proposed model.Experimental results on the MELD dataset show that TA-MERT outperforms earlier methods,achieving a weighted F1 score of 62.60%and an accuracy of 64.36%.Overall,the proposed TA-MERT model solves the GBN models’weaknesses in obtaining contextual data for ERC.TA-MERT model recognizes human emotions more accurately by employing text augmentation and transformer-based encoding.The balanced dataset and the additional training samples also enhance its resilience.These findings highlight the significance of transformer-based approaches for special emotion recognition in conversations.
基金supported by the National Natural Science Foundation of China(Grant Nos.62377007 and 62407009)the Chongqing University Graduate Education Teaching Reform Research Key Project,China(Grant No.232073)+1 种基金the Scientific and Technological Research Program of Chongqing Municipal Education Commission,China(Grant Nos.KJZD-M202400606 and KJZD-M202300603)the Chongqing Natural Science Foundation Joint Key Project for Innovation and Development,China(Grant No.2024NSCQ-LZX0057).
文摘Teacher emotion recognition(TER)has a significant impact on student engagement,classroom atmosphere,and teaching quality,which is a research hotspot in the smart education area.However,existing studies lack high-quality multimodal datasets and neglect common and discriminative features of multimodal data in emotion expression.To address these challenges,this research constructs a multimodal TER dataset suitable for real classroom teaching scenarios.TER dataset contains a total of 102 lessons and 2,170 video segments from multiple educational stages and subjects,innovatively labelled with emotional tags that characterize teacher‒student interactions,such as satisfaction and questions.To explore the characteristics of multimodal data in emotion expression,this research proposes an emotion dual-space network(EDSN)that establishes an emotion commonality space construction(ECSC)module and an emotion discrimination space construction(EDSC)module.Specifically,the EDSN utilizes central moment differences to measure the similarity to assess the correlation between multiple modalities within the emotion commonality space.On this basis,the gradient reversal layer and orthogonal projection are further utilized to construct the EDSC to extract unique emotional information and remove redundant information from each modality.Experimental results demonstrate that the EDSN achieves an accuracy of 0.770 and a weighted F1 score of 0.769 on the TER dataset,outperforming other comparative models.
基金supported by the Start-up Fund for RAPs under the Strategic Hiring Scheme(P0048623)from HKSARthe Global STEM Professorship Scheme(P0046113)and Henry G.Leong Endowed Professorship in Elderly Vision Health.
文摘Purposes:To develop a bilingual multimodal visual question answering(VQA)benchmark for evaluating Vision-language models(VLMs)in ophthalmology.Methods:In this cross-sectional study,ophthalmic image posts and associated captions published between Jan 1,2016,and Dec 31,2024,were collected from WeChat Official Accounts.Based on these captions,bilingual question-answer(QA)pairs in Chinese and English were generated using GPT-4o-mini.QA pairs were categorized into six subsets by question type and language:binary(Binary_CN,Binary_EN),single-choice(Singlechoice_CN,Single-choice_EN),and open-ended(Open-ended_CN,Open-ended_EN).The benchmark was used to evaluate six VLMs:GPT-4o,Gemini 2.0 Flash,Qwen2.5-VL-72B-Instruct,Janus-Pro-7B,InternVL3-8B,and HealthGPT-L14.Primary outcome was overall accuracy;secondary outcomes included subset-,subspeciality-,and modality-specific accuracy.Performance on open-ended questions were also quantified using languagebased metrics,including AlignScore,BARTScore,BERTScore,BLEU,CIDEr,METEOR,and ROUGE_L.Error types in open-ended responses were manually analyzed through stratified sampling.Results:OphthalWeChat included 3469 images and 30120 QA pairs cover 9 ophthalmic subspecialties,548 conditions,29 imaging modalities,and 68 modality combinations.Gemini 2.0 Flash achieved the highest overall accuracy(0.555),significantly outperforming GPT-4o(0.527),Qwen2.5-VL-72B-Instruct(0.520),HealthGPTL14(0.502),InternVL3-L14(0.453),and Janus-Pro-7B(0.333)(all P<0.001).It also led in both Chinese(0.551)and English subsets(0.559).By subset,Gemini 2.0 Flash excelled in Binary_CN(0.687)and Singlechoice_CN(0.666);HealthGPT-L14 performed best in Single-choice_EN(0.739);while GPT-4o ranked highest in Binary_EN(0.717),Open-ended_CN(0.254),and Open-ended_EN(0.271).Language-based metrics showed inconsistent rankings relative to accuracy in open-ended subsets.Performance varied across subspecialties and modalities,with Gemini 2.0 Flash leading in 6 of 9 subspecialties and 11 of top-15 imaging modalities.Error types analysis revealed lesion/diagnosis errors as the most frequent(35.6%-50.6%),followed by anatomical location errors(28.3%-37.5%).Conclusions:This study presents the first bilingual VQA benchmark for ophthalmology,distinguished by its realworld context and inclusion of multiple examinations per patient.The dataset enables quantitative evaluation of VLMs,supporting the development of accurate and specialized AI systems for eye care.