期刊文献+
共找到7篇文章
< 1 >
每页显示 20 50 100
Vector Representation and Multimodal Dataset Construction for the Digital Preservation of Linxia Brick Carving Cultural Heritage
1
作者 ZHANG Zhiteng CHEN Wangxing YU Hongzhi 《Journal of Resources and Ecology》 2025年第6期1646-1654,共9页
Linxia brick carving is an artistic carrier of multi-ethnic cultural intermingling,but its symbolic abstraction and diversity make digital conservation challenging.Currently,the traditional qualitative recording metho... Linxia brick carving is an artistic carrier of multi-ethnic cultural intermingling,but its symbolic abstraction and diversity make digital conservation challenging.Currently,the traditional qualitative recording methods are unable to realize dynamic analysis and innovative applications.This study builds a framework for the integration of vector representation and multimodal semantic mapping,and uses that framework to quantify the historical semantics,artistic fusion,and technological features of Linxia brick carving cultural heritage by constructing a 26-dimensional vector space.This approach allowed us to solve the semantic heterogeneity of the textual-image data through the help of structured descriptive templates.The results show that this framework can support the systematic analysis and innovation of Linxia brick carving cultural symbols with high classification accuracy and reveal the structured semantic association of patterns.This study realizes the transformation of abstract symbols to computable values through the generalized 26-dimensional vectors,and can use standardized templates to regulate their digital expressions,depending on multimodal data sets that establish the multidimensional innovation of artificial intelligence-driven protection mechanisms.The results can provide methodological support for the shift in cultural heritage from static records to living inheritance,and demonstrate potential transferability to analogous heritage contexts through dimensional remapping and template localization strategies.These advances can promote the deep integration of artificial intelligence and traditional art symbols,and thus support research on the protection strategies for traditional cultural heritage in the era of digitalization. 展开更多
关键词 Linxia brick carving vector representation multimodal dataset digital ecosystem
原文传递
Medical multimodal large language models:A systematic review 被引量:1
2
作者 Yuan Hu Chenhan Xu +2 位作者 Bo Lin Weibin Yang Yuan Yan Tang 《Intelligent Oncology》 2025年第4期308-325,共18页
The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and ge... The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and genomics to enhance clinical decision-making.This systematic review explores the core methodologies and applied research frontiers of medical MLLMs,focusing on their architecture,training methods,evaluation techniques,and applications.We highlight the transformative potential of MLLMs in achieving cross-modal semantic alignment,medical knowledge integration,and robust clinical reasoning.Despite their promise,challenges such as data heterogeneity,hallucination,and computational efficiency persist.By reviewing state-of-the-art solutions and future directions,this paper provides a comprehensive technical guide for developing reliable and interpretable medical MLLMs,ultimately aiming to bridge the gap between AI and clinical practice. 展开更多
关键词 multimodal large language model HALLUCINATION Medical multimodal dataset Clinical evaluation
在线阅读 下载PDF
Human Behaviour Classification in Emergency Situations Using Machine Learning with Multimodal Data:A Systematic Review(2020-2025)
3
作者 Mirza Murad Baig Muhammad Rehan Faheem +2 位作者 Lal Khan Hannan Adeel Syed Asim Ali Shah 《Computer Modeling in Engineering & Sciences》 2025年第12期2895-2935,共41页
With growing urban areas,the climate continues to change as a result of growing populations,and hence,the demand for better emergency response systems has become more important than ever.Human Behaviour Classi.cation(... With growing urban areas,the climate continues to change as a result of growing populations,and hence,the demand for better emergency response systems has become more important than ever.Human Behaviour Classi.cation(HBC)systems have started to play a vital role by analysing data from di.erent sources to detect signs of emergencies.These systems are being used inmany critical areas like healthcare,public safety,and disastermanagement to improve response time and to prepare ahead of time.But detecting human behaviour in such stressful conditions is not simple;it o.en comes with noisy data,missing information,and the need to react in real time.This review takes a deeper look at HBC research published between 2020 and 2025.and aims to answer.ve speci.c research questions.These questions cover the types of emergencies discussed in the literature,the datasets and sensors used,the e.ectiveness of machine learning(ML)and deep learning(DL)models,and the limitations that still exist in this.eld.We explored 120 papers that used di.erent types of datasets,some were based on sensor data,others on social media,and a few used hybrid approaches.Commonly used models included CNNs,LSTMs,and reinforcement learning methods to identify behaviours.Though a lot of progress has been made,the review found ongoing issues in combining sensors properly,reacting fast enough,and using more diverse datasets.Overall,from the.ndings we observed,the focus should be on building systems that use multiple sensors together,gather real-time data on a large scale,and produce results that are easier to interpret.Proper attention to privacy and ethical concerns needs to be addressed as well. 展开更多
关键词 Human behaviour ClassiThcation(HBC) public safety multimodal datasets privacy concerns missing information multi-sensor integration healthcare
在线阅读 下载PDF
A global multimodal flood event dataset with heterogeneous text and multi-source remote sensing images 被引量:1
4
作者 Zhixin Zhang Yan Ma Peng Liu 《Big Earth Data》 2025年第3期362-388,共27页
With the increasing frequency of floods,in-depth flood event analyses are essential for effective disaster relief and prevention.Satellite-based flood event datasets have become the primary data source for flood event... With the increasing frequency of floods,in-depth flood event analyses are essential for effective disaster relief and prevention.Satellite-based flood event datasets have become the primary data source for flood event analyses instead of limited disaster maps due to their enhanced availability.Nevertheless,despite the vast amount of available remote sensing images,existing flood event datasets continue to pose significant challenges in flood event analyses due to the uneven geographical distribution of data,the scarcity of time series data,and the limited availability of flood-related semantic information.There has been a surge in acceptance of deep learning models for flood event analyses,but some existing flood datasets do not align well with model training,and distinguishing flooded areas has proven difficult with limited data modalities and semantic information.Moreover,efficient retrieval and pre-screening of flood-related imagery from vast satellite data impose notable obstacles,particularly within large-scale analyses.To address these issues,we propose a Multimodal Flood Event Dataset(MFED)for deep-learning-based flood event analyses and data retrieval.It consists of 18 years of multi-source remote sensing imagery and heterogeneous textual information covering flood-prone areas worldwide.Incorporating optical and radar imagery can exploit the correlation and complementarity between distinct image modalities to capture the pixel features in flood imagery.It is worth noting that text modality data,including auxiliary hydrological information extracted from the Global Flood Database and text information refined from online news records,can also offer a semantic supplement to the images for flood event retrieval and analysis.To verify the applicability of the MFED in deep learning models,we carried out experiments with different models using a single modality and different combinations of modalities,which fully verified the effectiveness of the dataset.Furthermore,we also verify the efficiency of the MFED in comparative experiments with existing multimodal datasets and diverse neural network structures. 展开更多
关键词 Flood event multimodal dataset deep learning multi-source remote sensing data internet data
原文传递
Text Augmentation-Based Model for Emotion Recognition Using Transformers
5
作者 Fida Mohammad Mukhtaj Khan +4 位作者 Safdar Nawaz Khan Marwat Naveed Jan Neelam Gohar Muhammad Bilal Amal Al-Rasheed 《Computers, Materials & Continua》 SCIE EI 2023年第9期3523-3547,共25页
Emotion Recognition in Conversations(ERC)is fundamental in creating emotionally intelligentmachines.Graph-BasedNetwork(GBN)models have gained popularity in detecting conversational contexts for ERC tasks.However,their... Emotion Recognition in Conversations(ERC)is fundamental in creating emotionally intelligentmachines.Graph-BasedNetwork(GBN)models have gained popularity in detecting conversational contexts for ERC tasks.However,their limited ability to collect and acquire contextual information hinders their effectiveness.We propose a Text Augmentation-based computational model for recognizing emotions using transformers(TA-MERT)to address this.The proposed model uses the Multimodal Emotion Lines Dataset(MELD),which ensures a balanced representation for recognizing human emotions.Themodel used text augmentation techniques to producemore training data,improving the proposed model’s accuracy.Transformer encoders train the deep neural network(DNN)model,especially Bidirectional Encoder(BE)representations that capture both forward and backward contextual information.This integration improves the accuracy and robustness of the proposed model.Furthermore,we present a method for balancing the training dataset by creating enhanced samples from the original dataset.By balancing the dataset across all emotion categories,we can lessen the adverse effects of data imbalance on the accuracy of the proposed model.Experimental results on the MELD dataset show that TA-MERT outperforms earlier methods,achieving a weighted F1 score of 62.60%and an accuracy of 64.36%.Overall,the proposed TA-MERT model solves the GBN models’weaknesses in obtaining contextual data for ERC.TA-MERT model recognizes human emotions more accurately by employing text augmentation and transformer-based encoding.The balanced dataset and the additional training samples also enhance its resilience.These findings highlight the significance of transformer-based approaches for special emotion recognition in conversations. 展开更多
关键词 Emotion recognition in conversation graph-based network text augmentation-basedmodel multimodal emotion lines dataset bidirectional encoder representation for transformer
在线阅读 下载PDF
Emotion Dual-Space Network Based on Common and Discriminative Features for Multimodal Teacher Emotion Recognition
6
作者 Ting Cai Shengsong Wang +2 位作者 Jing Wang Yu Xiong Long Liu 《Frontiers of Digital Education》 2025年第3期57-71,共15页
Teacher emotion recognition(TER)has a significant impact on student engagement,classroom atmosphere,and teaching quality,which is a research hotspot in the smart education area.However,existing studies lack high-quali... Teacher emotion recognition(TER)has a significant impact on student engagement,classroom atmosphere,and teaching quality,which is a research hotspot in the smart education area.However,existing studies lack high-quality multimodal datasets and neglect common and discriminative features of multimodal data in emotion expression.To address these challenges,this research constructs a multimodal TER dataset suitable for real classroom teaching scenarios.TER dataset contains a total of 102 lessons and 2,170 video segments from multiple educational stages and subjects,innovatively labelled with emotional tags that characterize teacher‒student interactions,such as satisfaction and questions.To explore the characteristics of multimodal data in emotion expression,this research proposes an emotion dual-space network(EDSN)that establishes an emotion commonality space construction(ECSC)module and an emotion discrimination space construction(EDSC)module.Specifically,the EDSN utilizes central moment differences to measure the similarity to assess the correlation between multiple modalities within the emotion commonality space.On this basis,the gradient reversal layer and orthogonal projection are further utilized to construct the EDSC to extract unique emotional information and remove redundant information from each modality.Experimental results demonstrate that the EDSN achieves an accuracy of 0.770 and a weighted F1 score of 0.769 on the TER dataset,outperforming other comparative models. 展开更多
关键词 teacher emotion recognition emotion dualspace network multimodal teacher emotion dataset emotion commonality space construction module emotion discrimination space construction module
在线阅读 下载PDF
Benchmarking large multimodal models for ophthalmic visual question answering with OphthalWeChat
7
作者 Pusheng Xu Xia Gong +7 位作者 Xiaolan Chen Weiyi Zhang Jiancheng Yang Bingjie Yan Meng Yuan Yalin Zheng Mingguang He Danli Shi 《Advances in Ophthalmology Practice and Research》 2026年第1期33-41,共9页
Purposes:To develop a bilingual multimodal visual question answering(VQA)benchmark for evaluating Vision-language models(VLMs)in ophthalmology.Methods:In this cross-sectional study,ophthalmic image posts and associate... Purposes:To develop a bilingual multimodal visual question answering(VQA)benchmark for evaluating Vision-language models(VLMs)in ophthalmology.Methods:In this cross-sectional study,ophthalmic image posts and associated captions published between Jan 1,2016,and Dec 31,2024,were collected from WeChat Official Accounts.Based on these captions,bilingual question-answer(QA)pairs in Chinese and English were generated using GPT-4o-mini.QA pairs were categorized into six subsets by question type and language:binary(Binary_CN,Binary_EN),single-choice(Singlechoice_CN,Single-choice_EN),and open-ended(Open-ended_CN,Open-ended_EN).The benchmark was used to evaluate six VLMs:GPT-4o,Gemini 2.0 Flash,Qwen2.5-VL-72B-Instruct,Janus-Pro-7B,InternVL3-8B,and HealthGPT-L14.Primary outcome was overall accuracy;secondary outcomes included subset-,subspeciality-,and modality-specific accuracy.Performance on open-ended questions were also quantified using languagebased metrics,including AlignScore,BARTScore,BERTScore,BLEU,CIDEr,METEOR,and ROUGE_L.Error types in open-ended responses were manually analyzed through stratified sampling.Results:OphthalWeChat included 3469 images and 30120 QA pairs cover 9 ophthalmic subspecialties,548 conditions,29 imaging modalities,and 68 modality combinations.Gemini 2.0 Flash achieved the highest overall accuracy(0.555),significantly outperforming GPT-4o(0.527),Qwen2.5-VL-72B-Instruct(0.520),HealthGPTL14(0.502),InternVL3-L14(0.453),and Janus-Pro-7B(0.333)(all P<0.001).It also led in both Chinese(0.551)and English subsets(0.559).By subset,Gemini 2.0 Flash excelled in Binary_CN(0.687)and Singlechoice_CN(0.666);HealthGPT-L14 performed best in Single-choice_EN(0.739);while GPT-4o ranked highest in Binary_EN(0.717),Open-ended_CN(0.254),and Open-ended_EN(0.271).Language-based metrics showed inconsistent rankings relative to accuracy in open-ended subsets.Performance varied across subspecialties and modalities,with Gemini 2.0 Flash leading in 6 of 9 subspecialties and 11 of top-15 imaging modalities.Error types analysis revealed lesion/diagnosis errors as the most frequent(35.6%-50.6%),followed by anatomical location errors(28.3%-37.5%).Conclusions:This study presents the first bilingual VQA benchmark for ophthalmology,distinguished by its realworld context and inclusion of multiple examinations per patient.The dataset enables quantitative evaluation of VLMs,supporting the development of accurate and specialized AI systems for eye care. 展开更多
关键词 Visual question answering Ophthalmology multimodal benchmark multimodal dataset Vision-language models
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部