The application of fusion technology is of considerable importance in the field of multi-modal viewport prediction.The latest attention-based fusion methods have been shown to perform well in prediction accuracy.Howev...The application of fusion technology is of considerable importance in the field of multi-modal viewport prediction.The latest attention-based fusion methods have been shown to perform well in prediction accuracy.However,these methods fail to account for the differential density of information among the three modalities involved in viewport prediction-trajectory,visual,and audio.Visual and audio modalities present primitive signal information,while trajectory modality shows advanced time-series information.In this paper,a viewport prediction framework based on a Modality Diversity-Aware(MDA)fusion network is proposed to achieve multi-modal feature interaction.Firstly,we designed a fusion module to promote the combination of visual and auditory modalities,augmenting their efficacy as advanced complementary features.Subsequently,we utilize cross-modal attention to enable reinforced integration of visual-audio fused information and trajectory features.Our method addresses the issue of differing information densities among the three modalities,ensuring a fair and effective interaction between them.To evaluate the efficacy of the proposed approach,we conducted experiments on a widely-used public dataset.Experiments demonstrate that our approach predicts accurate viewport areas with a significant decrease in model parameters.展开更多
文摘The application of fusion technology is of considerable importance in the field of multi-modal viewport prediction.The latest attention-based fusion methods have been shown to perform well in prediction accuracy.However,these methods fail to account for the differential density of information among the three modalities involved in viewport prediction-trajectory,visual,and audio.Visual and audio modalities present primitive signal information,while trajectory modality shows advanced time-series information.In this paper,a viewport prediction framework based on a Modality Diversity-Aware(MDA)fusion network is proposed to achieve multi-modal feature interaction.Firstly,we designed a fusion module to promote the combination of visual and auditory modalities,augmenting their efficacy as advanced complementary features.Subsequently,we utilize cross-modal attention to enable reinforced integration of visual-audio fused information and trajectory features.Our method addresses the issue of differing information densities among the three modalities,ensuring a fair and effective interaction between them.To evaluate the efficacy of the proposed approach,we conducted experiments on a widely-used public dataset.Experiments demonstrate that our approach predicts accurate viewport areas with a significant decrease in model parameters.