Predicting mortality risk in the Intensive Care Unit(ICU)using Electronic Medical Records(EMR)is crucial for identifying patients in need of immediate attention.However,the incompleteness and the variability of EMR fe...Predicting mortality risk in the Intensive Care Unit(ICU)using Electronic Medical Records(EMR)is crucial for identifying patients in need of immediate attention.However,the incompleteness and the variability of EMR features for each patient make mortality prediction challenging.This study proposes a multimodal representation learning framework based on a novel personalized graph-based fusion approach to address these challenges.The proposed approach involves constructing patient-specific modality aggregation graphs to provide information about the features associated with each patient from incomplete multimodal data,enabling the effective and explainable fusion of the incomplete features.Modality-specific encoders are employed to encode each modality feature separately.To tackle the variability and incompleteness of input features among patients,a novel personalized graph-based fusion method is proposed to fuse patient-specific multimodal feature representations based on the constructed modality aggregation graphs.Furthermore,a MultiModal Gated Contrastive Representation Learning(MMGCRL)method is proposed to facilitate capturing adequate complementary information from multimodal representations and improve model performance.We evaluate the proposed framework using the large-scale ICU dataset,MIMIC-III.Experimental results demonstrate its effectiveness in mortality prediction,outperforming several state-of-the-art methods.展开更多
Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate...Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.展开更多
By efficiently and accurately predicting the adoptability of pets,shelters and rescuers can be positively guided on improving attraction of pet profiles,reducing animal suffering and euthanization.Previous prediction ...By efficiently and accurately predicting the adoptability of pets,shelters and rescuers can be positively guided on improving attraction of pet profiles,reducing animal suffering and euthanization.Previous prediction methods usually only used a single type of content for training.However,many pets contain not only textual content,but also images.To make full use of textual and visual information,this paper proposed a novel method to process pets that contain multimodal information.We employed several CNN(Convolutional Neural Network)based models and other methods to extract features from images and texts to obtain the initial multimodal representation,then reduce the dimensions and fuse them.Finally,we trained the fused features with two GBDT(Gradient Boosting Decision Tree)based models and a Neural Network(NN)and compare the performance of them and their ensemble.The evaluation result demonstrates that the proposed ensemble learning can improve the accuracy of prediction.展开更多
基金supported by the National Natural Science Foundation of China(No.U24A20256)and the Science and Technology Major Project of Changsha(No.kh2402004).
文摘Predicting mortality risk in the Intensive Care Unit(ICU)using Electronic Medical Records(EMR)is crucial for identifying patients in need of immediate attention.However,the incompleteness and the variability of EMR features for each patient make mortality prediction challenging.This study proposes a multimodal representation learning framework based on a novel personalized graph-based fusion approach to address these challenges.The proposed approach involves constructing patient-specific modality aggregation graphs to provide information about the features associated with each patient from incomplete multimodal data,enabling the effective and explainable fusion of the incomplete features.Modality-specific encoders are employed to encode each modality feature separately.To tackle the variability and incompleteness of input features among patients,a novel personalized graph-based fusion method is proposed to fuse patient-specific multimodal feature representations based on the constructed modality aggregation graphs.Furthermore,a MultiModal Gated Contrastive Representation Learning(MMGCRL)method is proposed to facilitate capturing adequate complementary information from multimodal representations and improve model performance.We evaluate the proposed framework using the large-scale ICU dataset,MIMIC-III.Experimental results demonstrate its effectiveness in mortality prediction,outperforming several state-of-the-art methods.
文摘Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.
基金This work is supported by The National Key Research and Development Program of China(2018YFB1800202,2016YFB1000302,SQ2019ZD090149,2018YFB0204301).
文摘By efficiently and accurately predicting the adoptability of pets,shelters and rescuers can be positively guided on improving attraction of pet profiles,reducing animal suffering and euthanization.Previous prediction methods usually only used a single type of content for training.However,many pets contain not only textual content,but also images.To make full use of textual and visual information,this paper proposed a novel method to process pets that contain multimodal information.We employed several CNN(Convolutional Neural Network)based models and other methods to extract features from images and texts to obtain the initial multimodal representation,then reduce the dimensions and fuse them.Finally,we trained the fused features with two GBDT(Gradient Boosting Decision Tree)based models and a Neural Network(NN)and compare the performance of them and their ensemble.The evaluation result demonstrates that the proposed ensemble learning can improve the accuracy of prediction.