With the increasing attention to the state and role of people in intelligent manufacturing, there is a strong demand for human-cyber-physical systems (HCPS) that focus on human-robot interaction. The existing intellig...With the increasing attention to the state and role of people in intelligent manufacturing, there is a strong demand for human-cyber-physical systems (HCPS) that focus on human-robot interaction. The existing intelligent manufacturing system cannot satisfy efcient human-robot collaborative work. However, unlike machines equipped with sensors, human characteristic information is difcult to be perceived and digitized instantly. In view of the high complexity and uncertainty of the human body, this paper proposes a framework for building a human digital twin (HDT) model based on multimodal data and expounds on the key technologies. Data acquisition system is built to dynamically acquire and update the body state data and physiological data of the human body and realize the digital expression of multi-source heterogeneous human body information. A bidirectional long short-term memory and convolutional neural network (BiLSTM-CNN) based network is devised to fuse multimodal human data and extract the spatiotemporal features, and the human locomotion mode identifcation is taken as an application case. A series of optimization experiments are carried out to improve the performance of the proposed BiLSTM-CNN-based network model. The proposed model is compared with traditional locomotion mode identifcation models. The experimental results proved the superiority of the HDT framework for human locomotion mode identifcation.展开更多
In this case study, we hypothesized that sympathetic nerve activity would be higher during conversation with PALRO robot, and that conversation would result in an increase in cerebral blood flow near the Broca’s area...In this case study, we hypothesized that sympathetic nerve activity would be higher during conversation with PALRO robot, and that conversation would result in an increase in cerebral blood flow near the Broca’s area. The facial expressions of a human subject were recorded, and cerebral blood flow and heart rate variability were measured during interactions with the humanoid robot. These multimodal data were time-synchronized to quantitatively verify the change from the resting baseline by testing facial expression analysis, cerebral blood flow, and heart rate variability. In conclusion, this subject indicated that sympathetic nervous activity was dominant, suggesting that the subject may have enjoyed and been excited while talking to the robot (normalized High Frequency < normalized Low Frequency: 0.22 ± 0.16 < 0.78 ± 0.16). Cerebral blood flow values were higher during conversation and in the resting state after the experiment than in the resting state before the experiment. Talking increased cerebral blood flow in the frontal region. As the subject was left-handed, it was confirmed that the right side of the brain, where the Broca’s area is located, was particularly activated (Left < right: 0.15 ± 0.21 < 1.25 ± 0.17). In the sections where a “happy” facial emotion was recognized, the examiner-judged “happy” faces and the MTCNN “happy” results were also generally consistent.展开更多
Bone tumors(BTs)-including osteosarcoma,Ewing sarcoma,and chondrosarcoma-are rare but biologically complex malignancies characterized by pronounced heterogeneity in anatomical location,histological subtype,and molecul...Bone tumors(BTs)-including osteosarcoma,Ewing sarcoma,and chondrosarcoma-are rare but biologically complex malignancies characterized by pronounced heterogeneity in anatomical location,histological subtype,and molecular alterations.Recent advances in artificial intelligence(AI),particularly deep learning,have enabled the integration of diverse clinical data modalities to support diagnosis,treatment planning,and prognostication in bone oncology.This review provides a comprehensive synthesis of AI-driven multimodal fusion strategies that incorporate radiological imaging,digital pathology,multi-omics profiling,and electronic health records.We conducted a structured review of peer-reviewed literature published between 2015 and early 2025,focusing on the development,validation,and clinical applicability of AI models for BT diagnosis,subtyping,treatment response prediction,and recurrence monitoring.Although multimodal models have demonstrated advantages over unimodal approaches,especially in handling missing data and improving generalizability,most remain constrained by single-center study designs,small sample sizes,and limited prospective or external validation.Persistent technical and translational challenges include semantic misalignment across modalities,incomplete datasets,limited model interpretability,and regulatory and infrastructural barriers to clinical integration.To address these limitations,we highlight emerging directions such as contrastive representation learning,generative data augmentation,transformer-based fusion architectures,and privacy-preserving federated learning.We also discuss the evolving role of foundation models and workflow-integrated AI agents in enhancing scalability and clinical usability.In summary,multimodal AI represents a promising paradigm for advancing precision care in BTs.Realizing its full clinical potential will require methodologically rigorous,biologically informed,and system-level approaches that bridge algorithmic innovation with real-world healthcare delivery.展开更多
This paper addresses the challenge of efficiently querying multimodal related data in data lakes,a large-scale storage and management system that supports heterogeneous data formats,including structured,semi-structure...This paper addresses the challenge of efficiently querying multimodal related data in data lakes,a large-scale storage and management system that supports heterogeneous data formats,including structured,semi-structured,and unstructured data.Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities,such as tables,images,and text,which has applications in fields like e-commerce,healthcare,and education.However,existing methods primarily focus on single-modality queries,such as joinable or unionable table discovery,and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency.To tackle these challenges,we propose a Multimodal data Query mechanism for Data Lakes(MQDL),which employs a modality-adaptive indexing mechanism raleted and contrastive learning based embeddings to unify representations across modalities.Additionally,we introduce product quantization to optimize candidate verification during queries,reducing computational overhead while maintaining precision.We evaluate MQDL using a table-image dataset across multiple business scenarios,measuring metrics such as precision,recall,and F1-score.Results show that MQDL achieves an accuracy rate of approximately 90%,while demonstrating strong scalability and reduced query response time compared to traditional methods.These findings highlight MQDL's potential to enhance multimodal data retrieval in complex data lake environments.展开更多
Seismic hazards pose a major threat to life safety,social development,and the economy.Traditional seismic vulnerability and risk assessments,such as field survey methods,may not be suitable for densely built-up urban ...Seismic hazards pose a major threat to life safety,social development,and the economy.Traditional seismic vulnerability and risk assessments,such as field survey methods,may not be suitable for densely built-up urban areas due to the limited availability of comprehensive data and potential subjectivity in judgment.To overcome these limitations,an integrated method for seismic vulnerability and risk assessment based on multimodal remote sensing data,support vector machine(SVM)and GIScience methods was proposed and applied to the central urban area of Jinan City,Shandong Province,China.First,an area with representative buildings was selected for field survey research,and an attribute information base established.Then,the SVM method was used to establish the susceptibility proxies,which were applied to the whole study area after accuracy evaluation.Finally,the spatial distribution of seismic vulnerability and risk under different seismic intensity scenarios(from VI to X)was analyzed in GIScience.The results show that the average building vulnerability index in the central urban area of Jinan City is 0.53,indicating that the overall seismic performance of buildings is at a moderate level.Under the seismic intensity scenario of VIII,the buildings in the Starting area and New urban district of Jinan would mostly suffer‘Moderate’damage,while Old urban areas,with more seismic-resistant buildings,would experience only‘Slight’damage.This study aims to offer an efficient and accurate method for assessing seismic vulnerability in mid to large-sized cities characterized by concentrated population densities and rapid urbanization,as well as provide a valuable reference for efforts in urban renewal,seismic mitigation,and land planning,particularly in cities and regions of developing countries.Additionally,it contributes to the realization of Sustainable Development Goal 11,which seeks to make cities and human settlements inclusive,safe,resilient,and sustainable.展开更多
Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces ...Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces significant data scarcity challenges,and direct fine-tuning of large-scale pre-trained models often leads to severe overfitting issues.In addition,it is a challenge to understand the underlying relationship between text and images in the hateful memes.To address these issues,we propose a multimodal hateful memes classification model named LABF,which is based on low-rank adapter layers and bidirectional gated feature fusion.Firstly,low-rank adapter layers are adopted to learn the feature representation of the new dataset.This is achieved by introducing a small number of additional parameters while retaining prior knowledge of the CLIP model,which effectively alleviates the overfitting phenomenon.Secondly,a bidirectional gated feature fusion mechanism is designed to dynamically adjust the interaction weights of text and image features to achieve finer cross-modal fusion.Experimental results show that the method significantly outperforms existing methods on two public datasets,verifying its effectiveness and robustness.展开更多
The coronavirus disease 2019 (COVID-19) pandemic has dramatically increased the awareness of emerging infectious diseases. The advancement of multiomics analysis technology has resulted in the development of several d...The coronavirus disease 2019 (COVID-19) pandemic has dramatically increased the awareness of emerging infectious diseases. The advancement of multiomics analysis technology has resulted in the development of several databases containing virus information. Several scientists have integrated existing data on viruses to construct phylogenetic trees and predict virus mutation and transmission in different ways, providing prospective technical support for epidemic prevention and control. This review summarized the databases of known emerging infectious viruses and techniques focusing on virus variant forecasting and early warning. It focuses on the multi-dimensional information integration and database construction of emerging infectious viruses, virus mutation spectrum construction and variant forecast model, analysis of the affinity between mutation antigen and the receptor, propagation model of virus dynamic evolution, and monitoring and early warning for variants. As people have suffered from COVID-19 and repeated flu outbreaks, we focused on the research results of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and influenza viruses. This review comprehensively viewed the latest virus research and provided a reference for future virus prevention and control research.展开更多
Artificial intelligence(AI)is driving a paradigm shift in gastroenterology and hepa-tology by delivering cutting-edge tools for disease screening,diagnosis,treatment,and prognostic management.Through deep learning,rad...Artificial intelligence(AI)is driving a paradigm shift in gastroenterology and hepa-tology by delivering cutting-edge tools for disease screening,diagnosis,treatment,and prognostic management.Through deep learning,radiomics,and multimodal data integration,AI has achieved diagnostic parity with expert cli-nicians in endoscopic image analysis(e.g.,early gastric cancer detection,colorectal polyp identification)and non-invasive assessment of liver pathologies(e.g.,fibrosis staging,fatty liver typing)while demonstrating utility in personalized care scenarios such as predicting hepatocellular carcinoma recurrence and opti-mizing inflammatory bowel disease treatment responses.Despite these advance-ments challenges persist including limited model generalization due to frag-mented datasets,algorithmic limitations in rare conditions(e.g.,pediatric liver diseases)caused by insufficient training data,and unresolved ethical issues related to bias,accountability,and patient privacy.Mitigation strategies involve constructing standardized multicenter databases,validating AI tools through prospective trials,leveraging federated learning to address data scarcity,and de-veloping interpretable systems(e.g.,attention heatmap visualization)to enhance clinical trust.Integrating generative AI,digital twin technologies,and establishing unified ethical/regulatory frameworks will accelerate AI adoption in primary care and foster equitable healthcare access while interdisciplinary collaboration and evidence-based implementation remain critical for realizing AI’s potential to redefine precision care for digestive disorders,improve global health outcomes,and reshape healthcare equity.展开更多
Population migration data derived from location-based services has often been used to delineate population flows between cities or construct intercity relationship networks to reveal and explore the complex interactio...Population migration data derived from location-based services has often been used to delineate population flows between cities or construct intercity relationship networks to reveal and explore the complex interaction patterns underlying human activities.Nevertheless,the inherent heterogeneity in multimodal migration big data has been ignored.This study conducts an in-depth comparison and quantitative analysis through a comprehensive lens of spatial association.Initially,the intercity interactive networks in China were constructed,utilizing migration data from Baidu and AutoNavi collected during the same time period.Subsequently,the characteristics and spatial structure similarities of the two types of intercity interactive networks were quantitatively assessed and analyzed from overall(network)and local(node)perspectives.Furthermore,the precision of these networks at the local scale is corroborated by constructing an intercity network from mobile phone(MP)data.Results indicate that the intercity interactive networks in China,as delineated by Baidu and AutoNavi migration flows,exhibit a high degree of structure equivalence.The correlation coefficient between these two networks is 0.874.Both networks exhibit a pronounced spatial polarization trend and hierarchical structure.This is evident in their distinct core and peripheral structures,as well as in the varying importance and influence of different nodes within the networks.Nevertheless,there are notable differences worthy of attention.Baidu intercity interactive network exhibits pronounced cross-regional effects,and its high-level interactions are characterized by a“rich-club”phenomenon.The AutoNavi intercity interactive network presents a more significant distance attenuation effect,and the high-level interactions display a gradient distribution pattern.Notably,there exists a substantial correlation between the AutoNavi and MP networks at the local scale,evidenced by a high correlation coefficient of 0.954.Furthermore,the“spatial dislocations”phenomenon was observed within the spatial structures at different levels,extracted from the Baidu and AutoNavi intercity networks.However,the measured results of network spatial structure similarity from three dimensions,namely,node location,node size,and local structure,indicate a relatively high similarity and consistency between the two networks.展开更多
Recent advances in computer vision and deep learning have shown that the fusion of depth information can significantly enhance the performance of RGB-based damage detection and segmentation models.However,alongside th...Recent advances in computer vision and deep learning have shown that the fusion of depth information can significantly enhance the performance of RGB-based damage detection and segmentation models.However,alongside the advantages,depth-sensing also presents many practical challenges.For instance,the depth sensors impose an additional payload burden on the robotic inspection platforms limiting the operation time and increasing the inspection cost.Additionally,some lidar-based depth sensors have poor outdoor performance due to sunlight contamination during the daytime.In this context,this study investigates the feasibility of abolishing depth-sensing at test time without compromising the segmentation performance.An autonomous damage segmentation framework is developed,based on recent advancements in vision-based multi-modal sensing such as modality hallucination(MH)and monocular depth estimation(MDE),which require depth data only during the model training.At the time of deployment,depth data becomes expendable as it can be simulated from the corresponding RGB frames.This makes it possible to reap the benefits of depth fusion without any depth perception per se.This study explored two different depth encoding techniques and three different fusion strategies in addition to a baseline RGB-based model.The proposed approach is validated on computer-generated RGB-D data of reinforced concrete buildings subjected to seismic damage.It was observed that the surrogate techniques can increase the segmentation IoU by up to 20.1%with a negligible increase in the computation cost.Overall,this study is believed to make a positive contribution to enhancing the resilience of critical civil infrastructure.展开更多
Psychological distress detection plays a critical role in modern healthcare,especially in ambient environments where continuous monitoring is essential for timely intervention.Advances in sensor technology and artific...Psychological distress detection plays a critical role in modern healthcare,especially in ambient environments where continuous monitoring is essential for timely intervention.Advances in sensor technology and artificial intelligence(AI)have enabled the development of systems capable of mental health monitoring using multimodal data.However,existing models often struggle with contextual adaptation and real-time decision-making in dynamic settings.This paper addresses these challenges by proposing TRANS-HEALTH,a hybrid framework that integrates transformer-based inference with Belief-Desire-Intention(BDI)reasoning for real-time psychological distress detection.The framework utilizes a multimodal dataset containing EEG,GSR,heart rate,and activity data to predict distress while adapting to individual contexts.The methodology combines deep learning for robust pattern recognition and symbolic BDI reasoning to enable adaptive decision-making.The novelty of the approach lies in its seamless integration of transformermodelswith BDI reasoning,providing both high accuracy and contextual relevance in real time.Performance metrics such as accuracy,precision,recall,and F1-score are employed to evaluate the system’s performance.The results show that TRANS-HEALTH outperforms existing models,achieving 96.1% accuracy with 4.78 ms latency and significantly reducing false alerts,with an enhanced ability to engage users,making it suitable for deployment in wearable and remote healthcare environments.展开更多
Industrial Internet of Things(IoT)connecting society and industrial systems represents a tremendous and promising paradigm shift.With IoT,multimodal and heterogeneous data from industrial devices can be easily collect...Industrial Internet of Things(IoT)connecting society and industrial systems represents a tremendous and promising paradigm shift.With IoT,multimodal and heterogeneous data from industrial devices can be easily collected,and further analyzed to discover device maintenance and health related potential knowledge behind.IoT data-based fault diagnosis for industrial devices is very helpful to the sustainability and applicability of an IoT ecosystem.But how to efficiently use and fuse this multimodal heterogeneous data to realize intelligent fault diagnosis is still a challenge.In this paper,a novel Deep Multimodal Learning and Fusion(DMLF)based fault diagnosis method is proposed for addressing heterogeneous data from IoT environments where industrial devices coexist.First,a DMLF model is designed by combining a Convolution Neural Network(CNN)and Stacked Denoising Autoencoder(SDAE)together to capture more comprehensive fault knowledge and extract features from different modal data.Second,these multimodal features are seamlessly integrated at a fusion layer and the resulting fused features are further used to train a classifier for recognizing potential faults.Third,a two-stage training algorithm is proposed by combining supervised pre-training and fine-tuning to simplify the training process for deep structure models.A series of experiments are conducted over multimodal heterogeneous data from a gear device to verify our proposed fault diagnosis method.The experimental results show that our method outperforms the benchmarking ones in fault diagnosis accuracy.展开更多
Efficiently predicting effluent quality through data-driven analysis presents a significant advancement for consistent wastewater treatment operations.In this study,we aimed to develop an integrated method for predict...Efficiently predicting effluent quality through data-driven analysis presents a significant advancement for consistent wastewater treatment operations.In this study,we aimed to develop an integrated method for predicting effluent COD and NH3 levels.We employed a 200 L pilot-scale sequencing batch reactor(SBR)to gather multimodal data from urban sewage over 40 d.Then we collected data on critical parameters like COD,DO,pH,NH_(3),EC,ORP,SS,and water temperature,alongside wastewater surface images,resulting in a data set of approximately 40246 points.Then we proposed a brain-inspired image and temporal fusion model integrated with a CNN-LSTM network(BITF-CL)using this data.This innovative model synergized sewage imagery with water quality data,enhancing prediction accuracy.As a result,the BITF-CL model reduced prediction error by over 23%compared to traditional methods and still performed comparably to conventional techniques even without using DO and SS sensor data.Consequently,this research presents a cost-effective and precise prediction system for sewage treatment,demonstrating the potential of brain-inspired models.展开更多
With the improvement of multisource information sensing and data acquisition capabilities inside tunnels,the availability of multimodal data in tunnel engineering has significantly increased.However,due to structural ...With the improvement of multisource information sensing and data acquisition capabilities inside tunnels,the availability of multimodal data in tunnel engineering has significantly increased.However,due to structural differences in multimodal data,traditional intelligent advanced geological prediction models have limited capacity for data fusion.Furthermore,the lack of pre-trained models makes it difficult for neural networks trained from scratch to deeply explore the features of multimodal data.To address these challenges,we utilize the fusion capability of knowledge graph for multimodal data and the pre-trained knowledge of large language models(LLMs)to establish an intelligent advanced geological prediction model(GeoPredict-LLM).First,we develop an advanced geological prediction ontology model,forming a knowledge graph database.Using knowledge graph embeddings,multisource and multimodal data are transformed into low-dimensional vectors with a unified structure.Secondly,pre-trained LLMs,through reprogramming,reconstruct these low-dimensional vectors,imparting linguistic characteristics to the data.This transformation effectively reframes the complex task of advanced geological prediction as a"language-based"problem,enabling the model to approach the task from a linguistic perspective.Moreover,we propose the prompt-as-prefix method,which enables output generation,while freezing the core of the LLM,thereby significantly reduces the number of training parameters.Finally,evaluations show that compared to neural network models without pre-trained models,GeoPredict-LLM significantly improves prediction accuracy.It is worth noting that as long as a knowledge graph database can be established,GeoPredict-LLM can be adapted to multimodal data mining tasks with minimal modifications.展开更多
Visual Question Answering(VQA)is an interdisciplinary artificial intelligence(AI)activity that integrates com-puter vision and natural language processing.Its purpose is to empower machines to respond to questions by ...Visual Question Answering(VQA)is an interdisciplinary artificial intelligence(AI)activity that integrates com-puter vision and natural language processing.Its purpose is to empower machines to respond to questions by utilizing visual information.A VQA system typically takes an image and a natural language query as input and produces a textual answer as output.One major obstacle in VQA is identifying a successful method to extract and merge textual and visual data.We examine“Fusion”Models that use information from both the text encoder and picture encoder to efficiently perform the visual question-answering challenge.For the transformer model,we utilize BERT and RoBERTa,which analyze textual data.The image encoder designed for processing image data utilizes ViT(Vision Transformer),Deit(Data-efficient Image Transformer),and BeIT(Image Transformers).The reasoning module of VQA was updated and layer normalization was incorporated to enhance the performance outcome of our effort.In comparison to the results of previous research,our proposed method suggests a substantial enhancement in efficacy.Our experiment obtained a 60.4%accuracy with the PathVQA dataset and a 69.2%accuracy with the VizWiz dataset.展开更多
In this paper,a meaningful representation of the road network using multiplex networks and a novel feature selection framework that enhances the predictability of future traffic conditions of an entire network are pro...In this paper,a meaningful representation of the road network using multiplex networks and a novel feature selection framework that enhances the predictability of future traffic conditions of an entire network are proposed.Using data on traffic volumes and tickets’validation from the transportation network of Athens,we were able to develop prediction models that not only achieve very good performance but are also trained efficiently,do not introduce high complexity and,thus,are suitable for real-time operation.More specifically,the network’s nodes(loop detectors and subway/metro stations)are organized as a multilayer graph,each layer representing an hour of the day.Nodes with similar structural properties are then classified in communities and are exploited as features to predict the future demand values of nodes belonging to the same community.The results reveal the potential of the proposed method to provide reliable and accurate predictions.展开更多
The recognition of dairy cow behavior is essential for enhancing health management,reproductive efficiency,production performance,and animal welfare.This paper addresses the challenge of modality loss in multimodal da...The recognition of dairy cow behavior is essential for enhancing health management,reproductive efficiency,production performance,and animal welfare.This paper addresses the challenge of modality loss in multimodal dairy cow behavior recognition algorithms,which can be caused by sensor or video signal disturbances arising from interference,harsh environmental conditions,extreme weather,network fluctuations,and other complexities inherent in farm environments.This study introduces a modality mapping completion network that maps incomplete sensor and video data to improve multimodal dairy cow behavior recognition under conditions of modality loss.By mapping incomplete sensor or video data,the method applies a multimodal behavior recognition algorithm to identify five specific behaviors:drinking,feeding,lying,standing,and walking.The results indicate that,under various comprehensive missing coefficients(λ),the method achieves an average accuracy of 97.87%±0.15%,an average precision of 95.19%±0.4%,and an average F1 score of 94.685%±0.375%,with an overall accuracy of 94.67%±0.37%.This approach enhances the robustness and applicability of cow behavior recognition based on multimodal data in situations of modality loss,resolving practical issues in the development of digital twins for cow behavior and providing comprehensive support for the intelligent and precise management of farms.展开更多
Skin diseases are important factors affecting health and quality of life,especially in rural areas where medical resources are limited.Early and accurate diagnosis can reduce unnecessary health and economic losses.How...Skin diseases are important factors affecting health and quality of life,especially in rural areas where medical resources are limited.Early and accurate diagnosis can reduce unnecessary health and economic losses.However,traditional visual diagnosis poses a high demand on both doctors’experience and the examination equipment,and there is a risk of missed diagnosis and misdiagnosis.Recently,advances in artificial intelligence technology,particularly deep learning,have resulted in the use of unimodal computer-aided diagnosis and treatment technologies based on skin images in dermatology.However,due to the small amount of information contained in unimodality,this technology cannot fully demonstrate the advantages of multimodal data in the real-world medical environment.Multimodal data fusion can fully integrate various types of data to help doctors make more accurate clinical decisions.This review aimed to provide a comprehensive overview of multimodal data and deep learning methods that could help dermatologists diagnose and treat skin diseases.展开更多
Nonalcoholic fatty liver disease(NAFLD)is the most common cause of chronic liver disease,and if it is accurately predicted,severe fibrosis and cirrhosis can be prevented.While liver biopsies,the gold standard for NAFL...Nonalcoholic fatty liver disease(NAFLD)is the most common cause of chronic liver disease,and if it is accurately predicted,severe fibrosis and cirrhosis can be prevented.While liver biopsies,the gold standard for NAFLD diagnosis,is intrusive,expensive,and prone to sample errors,noninvasive studies are extremely promising but are still in their infancy due to a dearth of comprehensive study data and sophisticated multimodal data methodologies.This paper proposes a novel approach for diagnosing NAFLD by integrating a comprehensive clinical dataset with a multimodal learning-based prediction method.The dataset comprises physical examinations,laboratory and imaging studies,detailed questionnaires,and facial photographs of a substantial number of participants,totaling more than 6000.This comprehensive collection of data holds significant value for clinical studies.The dataset is subjected to quantitative analysis to identify which clinical metadata,such as metadata and facial images,has the greatest impact on the prediction of NAFLD.Furthermore,a multimodal learning-based prediction method(DeepFLD)is proposed that incorporates several modalities and demonstrates superior performance compared to the methodology that relies only on metadata.Additionally,satisfactory performance is assessed through verification of the results using other unseen data.Inspiringly,the proposed DeepFLD prediction method can achieve competitive results by solely utilizing facial images as input rather than relying on metadata,paving the way for a more robust and simpler noninvasive NAFLD diagnosis.展开更多
With the rapid development of artificial intelligence(AI)technology,multimodal data integration has become an important means to improve the accuracy of diagnosis and treatment in gastroenterology and hepatology.This ...With the rapid development of artificial intelligence(AI)technology,multimodal data integration has become an important means to improve the accuracy of diagnosis and treatment in gastroenterology and hepatology.This article systematically reviews the latest progress of multimodal AI technology in the diagnosis,treatment,and decision-making for gastrointestinal tumors,functional gastrointestinal diseases,and liver diseases,focusing on the innovative applications of endoscopic image AI,pathological section AI,multi-omics data fusion models,and wearable devices combined with natural language processing.Multimodal AI can significantly improve the accuracy of early diagnosis and the efficiency of individualized treatment planning by integrating imaging,pathological data,molecular,and clinical phenotypic data.However,current AI technologies still face challenges such as insufficient data standardization,limited generalization of models,and ethical compliance.This paper proposes solutions,such as the establishment of cross-center data sharing platform,the development of federated learning framework,and the formulation of ethical norms,and looks forward to the application prospect of multimodal large-scale models in the disease management process.This review provides theoretical basis and practical guidance for promoting the clinical translation of AI technology in the field of gastroenterology and hepatology.展开更多
基金Supported by National Natural Science Foundation of China(Grant Nos.52205288,52130501,52075479)Zhejiang Provincial Key Research&Development Program(Grant No.2021C01110).
文摘With the increasing attention to the state and role of people in intelligent manufacturing, there is a strong demand for human-cyber-physical systems (HCPS) that focus on human-robot interaction. The existing intelligent manufacturing system cannot satisfy efcient human-robot collaborative work. However, unlike machines equipped with sensors, human characteristic information is difcult to be perceived and digitized instantly. In view of the high complexity and uncertainty of the human body, this paper proposes a framework for building a human digital twin (HDT) model based on multimodal data and expounds on the key technologies. Data acquisition system is built to dynamically acquire and update the body state data and physiological data of the human body and realize the digital expression of multi-source heterogeneous human body information. A bidirectional long short-term memory and convolutional neural network (BiLSTM-CNN) based network is devised to fuse multimodal human data and extract the spatiotemporal features, and the human locomotion mode identifcation is taken as an application case. A series of optimization experiments are carried out to improve the performance of the proposed BiLSTM-CNN-based network model. The proposed model is compared with traditional locomotion mode identifcation models. The experimental results proved the superiority of the HDT framework for human locomotion mode identifcation.
文摘In this case study, we hypothesized that sympathetic nerve activity would be higher during conversation with PALRO robot, and that conversation would result in an increase in cerebral blood flow near the Broca’s area. The facial expressions of a human subject were recorded, and cerebral blood flow and heart rate variability were measured during interactions with the humanoid robot. These multimodal data were time-synchronized to quantitatively verify the change from the resting baseline by testing facial expression analysis, cerebral blood flow, and heart rate variability. In conclusion, this subject indicated that sympathetic nervous activity was dominant, suggesting that the subject may have enjoyed and been excited while talking to the robot (normalized High Frequency < normalized Low Frequency: 0.22 ± 0.16 < 0.78 ± 0.16). Cerebral blood flow values were higher during conversation and in the resting state after the experiment than in the resting state before the experiment. Talking increased cerebral blood flow in the frontal region. As the subject was left-handed, it was confirmed that the right side of the brain, where the Broca’s area is located, was particularly activated (Left < right: 0.15 ± 0.21 < 1.25 ± 0.17). In the sections where a “happy” facial emotion was recognized, the examiner-judged “happy” faces and the MTCNN “happy” results were also generally consistent.
基金supported by the National Natural Science Foundation of China[Grant No.:82172524]the Natural Science Foundation of Hubei Province[Grant No.:2025AFB240].
文摘Bone tumors(BTs)-including osteosarcoma,Ewing sarcoma,and chondrosarcoma-are rare but biologically complex malignancies characterized by pronounced heterogeneity in anatomical location,histological subtype,and molecular alterations.Recent advances in artificial intelligence(AI),particularly deep learning,have enabled the integration of diverse clinical data modalities to support diagnosis,treatment planning,and prognostication in bone oncology.This review provides a comprehensive synthesis of AI-driven multimodal fusion strategies that incorporate radiological imaging,digital pathology,multi-omics profiling,and electronic health records.We conducted a structured review of peer-reviewed literature published between 2015 and early 2025,focusing on the development,validation,and clinical applicability of AI models for BT diagnosis,subtyping,treatment response prediction,and recurrence monitoring.Although multimodal models have demonstrated advantages over unimodal approaches,especially in handling missing data and improving generalizability,most remain constrained by single-center study designs,small sample sizes,and limited prospective or external validation.Persistent technical and translational challenges include semantic misalignment across modalities,incomplete datasets,limited model interpretability,and regulatory and infrastructural barriers to clinical integration.To address these limitations,we highlight emerging directions such as contrastive representation learning,generative data augmentation,transformer-based fusion architectures,and privacy-preserving federated learning.We also discuss the evolving role of foundation models and workflow-integrated AI agents in enhancing scalability and clinical usability.In summary,multimodal AI represents a promising paradigm for advancing precision care in BTs.Realizing its full clinical potential will require methodologically rigorous,biologically informed,and system-level approaches that bridge algorithmic innovation with real-world healthcare delivery.
文摘This paper addresses the challenge of efficiently querying multimodal related data in data lakes,a large-scale storage and management system that supports heterogeneous data formats,including structured,semi-structured,and unstructured data.Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities,such as tables,images,and text,which has applications in fields like e-commerce,healthcare,and education.However,existing methods primarily focus on single-modality queries,such as joinable or unionable table discovery,and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency.To tackle these challenges,we propose a Multimodal data Query mechanism for Data Lakes(MQDL),which employs a modality-adaptive indexing mechanism raleted and contrastive learning based embeddings to unify representations across modalities.Additionally,we introduce product quantization to optimize candidate verification during queries,reducing computational overhead while maintaining precision.We evaluate MQDL using a table-image dataset across multiple business scenarios,measuring metrics such as precision,recall,and F1-score.Results show that MQDL achieves an accuracy rate of approximately 90%,while demonstrating strong scalability and reduced query response time compared to traditional methods.These findings highlight MQDL's potential to enhance multimodal data retrieval in complex data lake environments.
基金supported in part by the National Natural Science Foundation of China(Grant No.42201077)the Natural Science Foundation of Shandong Province(No.ZR2021QD074)+2 种基金the China Postdoctoral Science Foundation(No.2023M732105)the Lhasa National Geophysical Observation and Research Station(No.NORSLS22-05)the Youth Innovation Team Project of Higher School in Shandong Province,China(No.2024KJH087).
文摘Seismic hazards pose a major threat to life safety,social development,and the economy.Traditional seismic vulnerability and risk assessments,such as field survey methods,may not be suitable for densely built-up urban areas due to the limited availability of comprehensive data and potential subjectivity in judgment.To overcome these limitations,an integrated method for seismic vulnerability and risk assessment based on multimodal remote sensing data,support vector machine(SVM)and GIScience methods was proposed and applied to the central urban area of Jinan City,Shandong Province,China.First,an area with representative buildings was selected for field survey research,and an attribute information base established.Then,the SVM method was used to establish the susceptibility proxies,which were applied to the whole study area after accuracy evaluation.Finally,the spatial distribution of seismic vulnerability and risk under different seismic intensity scenarios(from VI to X)was analyzed in GIScience.The results show that the average building vulnerability index in the central urban area of Jinan City is 0.53,indicating that the overall seismic performance of buildings is at a moderate level.Under the seismic intensity scenario of VIII,the buildings in the Starting area and New urban district of Jinan would mostly suffer‘Moderate’damage,while Old urban areas,with more seismic-resistant buildings,would experience only‘Slight’damage.This study aims to offer an efficient and accurate method for assessing seismic vulnerability in mid to large-sized cities characterized by concentrated population densities and rapid urbanization,as well as provide a valuable reference for efforts in urban renewal,seismic mitigation,and land planning,particularly in cities and regions of developing countries.Additionally,it contributes to the realization of Sustainable Development Goal 11,which seeks to make cities and human settlements inclusive,safe,resilient,and sustainable.
基金supported by the Funding for Research on the Evolution of Cyberbullying Incidents and Intervention Strategies(24BSH033)Discipline Innovation and Talent Introduction Bases in Higher Education Institutions(B20087).
文摘Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces significant data scarcity challenges,and direct fine-tuning of large-scale pre-trained models often leads to severe overfitting issues.In addition,it is a challenge to understand the underlying relationship between text and images in the hateful memes.To address these issues,we propose a multimodal hateful memes classification model named LABF,which is based on low-rank adapter layers and bidirectional gated feature fusion.Firstly,low-rank adapter layers are adopted to learn the feature representation of the new dataset.This is achieved by introducing a small number of additional parameters while retaining prior knowledge of the CLIP model,which effectively alleviates the overfitting phenomenon.Secondly,a bidirectional gated feature fusion mechanism is designed to dynamically adjust the interaction weights of text and image features to achieve finer cross-modal fusion.Experimental results show that the method significantly outperforms existing methods on two public datasets,verifying its effectiveness and robustness.
基金supported by the National Key R&D Program of China(2022YFF1203202,2018YFC2000205)the Strategic Priority Research Program of the Chinese Academy of Sciences(XDB38050200,XDA26040304)the Self-supporting Program of Guangzhou Laboratory(SRPG22-007).
文摘The coronavirus disease 2019 (COVID-19) pandemic has dramatically increased the awareness of emerging infectious diseases. The advancement of multiomics analysis technology has resulted in the development of several databases containing virus information. Several scientists have integrated existing data on viruses to construct phylogenetic trees and predict virus mutation and transmission in different ways, providing prospective technical support for epidemic prevention and control. This review summarized the databases of known emerging infectious viruses and techniques focusing on virus variant forecasting and early warning. It focuses on the multi-dimensional information integration and database construction of emerging infectious viruses, virus mutation spectrum construction and variant forecast model, analysis of the affinity between mutation antigen and the receptor, propagation model of virus dynamic evolution, and monitoring and early warning for variants. As people have suffered from COVID-19 and repeated flu outbreaks, we focused on the research results of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and influenza viruses. This review comprehensively viewed the latest virus research and provided a reference for future virus prevention and control research.
基金Supported by the Natural Science Foundation of Jilin Province,No.YDZJ202401182ZYTSJilin Provincial Key Laboratory of Precision Infectious Diseases,No.20200601011JCJilin Provincial Engineering Laboratory of Precision Prevention and Control for Common Diseases,Jilin Province Development and Reform Commission,No.2022C036.
文摘Artificial intelligence(AI)is driving a paradigm shift in gastroenterology and hepa-tology by delivering cutting-edge tools for disease screening,diagnosis,treatment,and prognostic management.Through deep learning,radiomics,and multimodal data integration,AI has achieved diagnostic parity with expert cli-nicians in endoscopic image analysis(e.g.,early gastric cancer detection,colorectal polyp identification)and non-invasive assessment of liver pathologies(e.g.,fibrosis staging,fatty liver typing)while demonstrating utility in personalized care scenarios such as predicting hepatocellular carcinoma recurrence and opti-mizing inflammatory bowel disease treatment responses.Despite these advance-ments challenges persist including limited model generalization due to frag-mented datasets,algorithmic limitations in rare conditions(e.g.,pediatric liver diseases)caused by insufficient training data,and unresolved ethical issues related to bias,accountability,and patient privacy.Mitigation strategies involve constructing standardized multicenter databases,validating AI tools through prospective trials,leveraging federated learning to address data scarcity,and de-veloping interpretable systems(e.g.,attention heatmap visualization)to enhance clinical trust.Integrating generative AI,digital twin technologies,and establishing unified ethical/regulatory frameworks will accelerate AI adoption in primary care and foster equitable healthcare access while interdisciplinary collaboration and evidence-based implementation remain critical for realizing AI’s potential to redefine precision care for digestive disorders,improve global health outcomes,and reshape healthcare equity.
基金National Natural Science Foundation of China,No.42361040。
文摘Population migration data derived from location-based services has often been used to delineate population flows between cities or construct intercity relationship networks to reveal and explore the complex interaction patterns underlying human activities.Nevertheless,the inherent heterogeneity in multimodal migration big data has been ignored.This study conducts an in-depth comparison and quantitative analysis through a comprehensive lens of spatial association.Initially,the intercity interactive networks in China were constructed,utilizing migration data from Baidu and AutoNavi collected during the same time period.Subsequently,the characteristics and spatial structure similarities of the two types of intercity interactive networks were quantitatively assessed and analyzed from overall(network)and local(node)perspectives.Furthermore,the precision of these networks at the local scale is corroborated by constructing an intercity network from mobile phone(MP)data.Results indicate that the intercity interactive networks in China,as delineated by Baidu and AutoNavi migration flows,exhibit a high degree of structure equivalence.The correlation coefficient between these two networks is 0.874.Both networks exhibit a pronounced spatial polarization trend and hierarchical structure.This is evident in their distinct core and peripheral structures,as well as in the varying importance and influence of different nodes within the networks.Nevertheless,there are notable differences worthy of attention.Baidu intercity interactive network exhibits pronounced cross-regional effects,and its high-level interactions are characterized by a“rich-club”phenomenon.The AutoNavi intercity interactive network presents a more significant distance attenuation effect,and the high-level interactions display a gradient distribution pattern.Notably,there exists a substantial correlation between the AutoNavi and MP networks at the local scale,evidenced by a high correlation coefficient of 0.954.Furthermore,the“spatial dislocations”phenomenon was observed within the spatial structures at different levels,extracted from the Baidu and AutoNavi intercity networks.However,the measured results of network spatial structure similarity from three dimensions,namely,node location,node size,and local structure,indicate a relatively high similarity and consistency between the two networks.
基金supported in part by a fund from Bentley Systems,Inc.
文摘Recent advances in computer vision and deep learning have shown that the fusion of depth information can significantly enhance the performance of RGB-based damage detection and segmentation models.However,alongside the advantages,depth-sensing also presents many practical challenges.For instance,the depth sensors impose an additional payload burden on the robotic inspection platforms limiting the operation time and increasing the inspection cost.Additionally,some lidar-based depth sensors have poor outdoor performance due to sunlight contamination during the daytime.In this context,this study investigates the feasibility of abolishing depth-sensing at test time without compromising the segmentation performance.An autonomous damage segmentation framework is developed,based on recent advancements in vision-based multi-modal sensing such as modality hallucination(MH)and monocular depth estimation(MDE),which require depth data only during the model training.At the time of deployment,depth data becomes expendable as it can be simulated from the corresponding RGB frames.This makes it possible to reap the benefits of depth fusion without any depth perception per se.This study explored two different depth encoding techniques and three different fusion strategies in addition to a baseline RGB-based model.The proposed approach is validated on computer-generated RGB-D data of reinforced concrete buildings subjected to seismic damage.It was observed that the surrogate techniques can increase the segmentation IoU by up to 20.1%with a negligible increase in the computation cost.Overall,this study is believed to make a positive contribution to enhancing the resilience of critical civil infrastructure.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R435),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Psychological distress detection plays a critical role in modern healthcare,especially in ambient environments where continuous monitoring is essential for timely intervention.Advances in sensor technology and artificial intelligence(AI)have enabled the development of systems capable of mental health monitoring using multimodal data.However,existing models often struggle with contextual adaptation and real-time decision-making in dynamic settings.This paper addresses these challenges by proposing TRANS-HEALTH,a hybrid framework that integrates transformer-based inference with Belief-Desire-Intention(BDI)reasoning for real-time psychological distress detection.The framework utilizes a multimodal dataset containing EEG,GSR,heart rate,and activity data to predict distress while adapting to individual contexts.The methodology combines deep learning for robust pattern recognition and symbolic BDI reasoning to enable adaptive decision-making.The novelty of the approach lies in its seamless integration of transformermodelswith BDI reasoning,providing both high accuracy and contextual relevance in real time.Performance metrics such as accuracy,precision,recall,and F1-score are employed to evaluate the system’s performance.The results show that TRANS-HEALTH outperforms existing models,achieving 96.1% accuracy with 4.78 ms latency and significantly reducing false alerts,with an enhanced ability to engage users,making it suitable for deployment in wearable and remote healthcare environments.
基金supported in part by the National Key Research and Development Program of China(No.2018YFB1003700)in part by the National Natural Science Foundation of China(No.61836001)。
文摘Industrial Internet of Things(IoT)connecting society and industrial systems represents a tremendous and promising paradigm shift.With IoT,multimodal and heterogeneous data from industrial devices can be easily collected,and further analyzed to discover device maintenance and health related potential knowledge behind.IoT data-based fault diagnosis for industrial devices is very helpful to the sustainability and applicability of an IoT ecosystem.But how to efficiently use and fuse this multimodal heterogeneous data to realize intelligent fault diagnosis is still a challenge.In this paper,a novel Deep Multimodal Learning and Fusion(DMLF)based fault diagnosis method is proposed for addressing heterogeneous data from IoT environments where industrial devices coexist.First,a DMLF model is designed by combining a Convolution Neural Network(CNN)and Stacked Denoising Autoencoder(SDAE)together to capture more comprehensive fault knowledge and extract features from different modal data.Second,these multimodal features are seamlessly integrated at a fusion layer and the resulting fused features are further used to train a classifier for recognizing potential faults.Third,a two-stage training algorithm is proposed by combining supervised pre-training and fine-tuning to simplify the training process for deep structure models.A series of experiments are conducted over multimodal heterogeneous data from a gear device to verify our proposed fault diagnosis method.The experimental results show that our method outperforms the benchmarking ones in fault diagnosis accuracy.
基金supported by the National Key R&D Program of China(No.2021YFC1809001).
文摘Efficiently predicting effluent quality through data-driven analysis presents a significant advancement for consistent wastewater treatment operations.In this study,we aimed to develop an integrated method for predicting effluent COD and NH3 levels.We employed a 200 L pilot-scale sequencing batch reactor(SBR)to gather multimodal data from urban sewage over 40 d.Then we collected data on critical parameters like COD,DO,pH,NH_(3),EC,ORP,SS,and water temperature,alongside wastewater surface images,resulting in a data set of approximately 40246 points.Then we proposed a brain-inspired image and temporal fusion model integrated with a CNN-LSTM network(BITF-CL)using this data.This innovative model synergized sewage imagery with water quality data,enhancing prediction accuracy.As a result,the BITF-CL model reduced prediction error by over 23%compared to traditional methods and still performed comparably to conventional techniques even without using DO and SS sensor data.Consequently,this research presents a cost-effective and precise prediction system for sewage treatment,demonstrating the potential of brain-inspired models.
基金the National Natural Science Foundation of China(Grant Nos.52279103 and 52379103)。
文摘With the improvement of multisource information sensing and data acquisition capabilities inside tunnels,the availability of multimodal data in tunnel engineering has significantly increased.However,due to structural differences in multimodal data,traditional intelligent advanced geological prediction models have limited capacity for data fusion.Furthermore,the lack of pre-trained models makes it difficult for neural networks trained from scratch to deeply explore the features of multimodal data.To address these challenges,we utilize the fusion capability of knowledge graph for multimodal data and the pre-trained knowledge of large language models(LLMs)to establish an intelligent advanced geological prediction model(GeoPredict-LLM).First,we develop an advanced geological prediction ontology model,forming a knowledge graph database.Using knowledge graph embeddings,multisource and multimodal data are transformed into low-dimensional vectors with a unified structure.Secondly,pre-trained LLMs,through reprogramming,reconstruct these low-dimensional vectors,imparting linguistic characteristics to the data.This transformation effectively reframes the complex task of advanced geological prediction as a"language-based"problem,enabling the model to approach the task from a linguistic perspective.Moreover,we propose the prompt-as-prefix method,which enables output generation,while freezing the core of the LLM,thereby significantly reduces the number of training parameters.Finally,evaluations show that compared to neural network models without pre-trained models,GeoPredict-LLM significantly improves prediction accuracy.It is worth noting that as long as a knowledge graph database can be established,GeoPredict-LLM can be adapted to multimodal data mining tasks with minimal modifications.
基金supported by the National Science and Technology Council,Taiwan(Grant number:NSTC 111-2637-H-324-001-).
文摘Visual Question Answering(VQA)is an interdisciplinary artificial intelligence(AI)activity that integrates com-puter vision and natural language processing.Its purpose is to empower machines to respond to questions by utilizing visual information.A VQA system typically takes an image and a natural language query as input and produces a textual answer as output.One major obstacle in VQA is identifying a successful method to extract and merge textual and visual data.We examine“Fusion”Models that use information from both the text encoder and picture encoder to efficiently perform the visual question-answering challenge.For the transformer model,we utilize BERT and RoBERTa,which analyze textual data.The image encoder designed for processing image data utilizes ViT(Vision Transformer),Deit(Data-efficient Image Transformer),and BeIT(Image Transformers).The reasoning module of VQA was updated and layer normalization was incorporated to enhance the performance outcome of our effort.In comparison to the results of previous research,our proposed method suggests a substantial enhancement in efficacy.Our experiment obtained a 60.4%accuracy with the PathVQA dataset and a 69.2%accuracy with the VizWiz dataset.
文摘In this paper,a meaningful representation of the road network using multiplex networks and a novel feature selection framework that enhances the predictability of future traffic conditions of an entire network are proposed.Using data on traffic volumes and tickets’validation from the transportation network of Athens,we were able to develop prediction models that not only achieve very good performance but are also trained efficiently,do not introduce high complexity and,thus,are suitable for real-time operation.More specifically,the network’s nodes(loop detectors and subway/metro stations)are organized as a multilayer graph,each layer representing an hour of the day.Nodes with similar structural properties are then classified in communities and are exploited as features to predict the future demand values of nodes belonging to the same community.The results reveal the potential of the proposed method to provide reliable and accurate predictions.
基金supported by the National Key Research and Development Program of China(Grand No.2023YFD2000700)“Supported by the earmarked fund for CARS(CARS-36)”the Key Research and Development Program of Heilongjiang Province of China(Grant No.2022ZX01A24).
文摘The recognition of dairy cow behavior is essential for enhancing health management,reproductive efficiency,production performance,and animal welfare.This paper addresses the challenge of modality loss in multimodal dairy cow behavior recognition algorithms,which can be caused by sensor or video signal disturbances arising from interference,harsh environmental conditions,extreme weather,network fluctuations,and other complexities inherent in farm environments.This study introduces a modality mapping completion network that maps incomplete sensor and video data to improve multimodal dairy cow behavior recognition under conditions of modality loss.By mapping incomplete sensor or video data,the method applies a multimodal behavior recognition algorithm to identify five specific behaviors:drinking,feeding,lying,standing,and walking.The results indicate that,under various comprehensive missing coefficients(λ),the method achieves an average accuracy of 97.87%±0.15%,an average precision of 95.19%±0.4%,and an average F1 score of 94.685%±0.375%,with an overall accuracy of 94.67%±0.37%.This approach enhances the robustness and applicability of cow behavior recognition based on multimodal data in situations of modality loss,resolving practical issues in the development of digital twins for cow behavior and providing comprehensive support for the intelligent and precise management of farms.
基金supported by grants from the Air Force Logistics Department(Grant No.BKJW221J009).
文摘Skin diseases are important factors affecting health and quality of life,especially in rural areas where medical resources are limited.Early and accurate diagnosis can reduce unnecessary health and economic losses.However,traditional visual diagnosis poses a high demand on both doctors’experience and the examination equipment,and there is a risk of missed diagnosis and misdiagnosis.Recently,advances in artificial intelligence technology,particularly deep learning,have resulted in the use of unimodal computer-aided diagnosis and treatment technologies based on skin images in dermatology.However,due to the small amount of information contained in unimodality,this technology cannot fully demonstrate the advantages of multimodal data in the real-world medical environment.Multimodal data fusion can fully integrate various types of data to help doctors make more accurate clinical decisions.This review aimed to provide a comprehensive overview of multimodal data and deep learning methods that could help dermatologists diagnose and treat skin diseases.
文摘Nonalcoholic fatty liver disease(NAFLD)is the most common cause of chronic liver disease,and if it is accurately predicted,severe fibrosis and cirrhosis can be prevented.While liver biopsies,the gold standard for NAFLD diagnosis,is intrusive,expensive,and prone to sample errors,noninvasive studies are extremely promising but are still in their infancy due to a dearth of comprehensive study data and sophisticated multimodal data methodologies.This paper proposes a novel approach for diagnosing NAFLD by integrating a comprehensive clinical dataset with a multimodal learning-based prediction method.The dataset comprises physical examinations,laboratory and imaging studies,detailed questionnaires,and facial photographs of a substantial number of participants,totaling more than 6000.This comprehensive collection of data holds significant value for clinical studies.The dataset is subjected to quantitative analysis to identify which clinical metadata,such as metadata and facial images,has the greatest impact on the prediction of NAFLD.Furthermore,a multimodal learning-based prediction method(DeepFLD)is proposed that incorporates several modalities and demonstrates superior performance compared to the methodology that relies only on metadata.Additionally,satisfactory performance is assessed through verification of the results using other unseen data.Inspiringly,the proposed DeepFLD prediction method can achieve competitive results by solely utilizing facial images as input rather than relying on metadata,paving the way for a more robust and simpler noninvasive NAFLD diagnosis.
文摘With the rapid development of artificial intelligence(AI)technology,multimodal data integration has become an important means to improve the accuracy of diagnosis and treatment in gastroenterology and hepatology.This article systematically reviews the latest progress of multimodal AI technology in the diagnosis,treatment,and decision-making for gastrointestinal tumors,functional gastrointestinal diseases,and liver diseases,focusing on the innovative applications of endoscopic image AI,pathological section AI,multi-omics data fusion models,and wearable devices combined with natural language processing.Multimodal AI can significantly improve the accuracy of early diagnosis and the efficiency of individualized treatment planning by integrating imaging,pathological data,molecular,and clinical phenotypic data.However,current AI technologies still face challenges such as insufficient data standardization,limited generalization of models,and ethical compliance.This paper proposes solutions,such as the establishment of cross-center data sharing platform,the development of federated learning framework,and the formulation of ethical norms,and looks forward to the application prospect of multimodal large-scale models in the disease management process.This review provides theoretical basis and practical guidance for promoting the clinical translation of AI technology in the field of gastroenterology and hepatology.