The application of fusion technology is of considerable importance in the field of multi-modal viewport prediction.The latest attention-based fusion methods have been shown to perform well in prediction accuracy.Howev...The application of fusion technology is of considerable importance in the field of multi-modal viewport prediction.The latest attention-based fusion methods have been shown to perform well in prediction accuracy.However,these methods fail to account for the differential density of information among the three modalities involved in viewport prediction-trajectory,visual,and audio.Visual and audio modalities present primitive signal information,while trajectory modality shows advanced time-series information.In this paper,a viewport prediction framework based on a Modality Diversity-Aware(MDA)fusion network is proposed to achieve multi-modal feature interaction.Firstly,we designed a fusion module to promote the combination of visual and auditory modalities,augmenting their efficacy as advanced complementary features.Subsequently,we utilize cross-modal attention to enable reinforced integration of visual-audio fused information and trajectory features.Our method addresses the issue of differing information densities among the three modalities,ensuring a fair and effective interaction between them.To evaluate the efficacy of the proposed approach,we conducted experiments on a widely-used public dataset.Experiments demonstrate that our approach predicts accurate viewport areas with a significant decrease in model parameters.展开更多
Accurate viewport prediction is crucial for enhancing user experience in 360-degree video streaming.However,due to significant behavioral differences among user groups,traditional single LSTM models tend to fall into ...Accurate viewport prediction is crucial for enhancing user experience in 360-degree video streaming.However,due to significant behavioral differences among user groups,traditional single LSTM models tend to fall into local optima and fail to achieve precise predictions.To address this,this paper proposes a hybrid prediction model based on user clustering.First,a Density-Based Clustering Algorithm(DBSCAN)is used to group users with similar behavioral patterns.Then,a hybrid prediction model combining Generative Adversarial Networks(GANs)and Long Short-Term Memory networks(LSTMs)is designed to effectively mitigate data imbalance and overfitting through collaborative training.Experiments conducted on three real-world datasets from YouTube demonstrate that this approach significantly outperforms existing methods based on user trajectories or video saliency in terms of prediction accuracy and stability.展开更多
文摘The application of fusion technology is of considerable importance in the field of multi-modal viewport prediction.The latest attention-based fusion methods have been shown to perform well in prediction accuracy.However,these methods fail to account for the differential density of information among the three modalities involved in viewport prediction-trajectory,visual,and audio.Visual and audio modalities present primitive signal information,while trajectory modality shows advanced time-series information.In this paper,a viewport prediction framework based on a Modality Diversity-Aware(MDA)fusion network is proposed to achieve multi-modal feature interaction.Firstly,we designed a fusion module to promote the combination of visual and auditory modalities,augmenting their efficacy as advanced complementary features.Subsequently,we utilize cross-modal attention to enable reinforced integration of visual-audio fused information and trajectory features.Our method addresses the issue of differing information densities among the three modalities,ensuring a fair and effective interaction between them.To evaluate the efficacy of the proposed approach,we conducted experiments on a widely-used public dataset.Experiments demonstrate that our approach predicts accurate viewport areas with a significant decrease in model parameters.
文摘Accurate viewport prediction is crucial for enhancing user experience in 360-degree video streaming.However,due to significant behavioral differences among user groups,traditional single LSTM models tend to fall into local optima and fail to achieve precise predictions.To address this,this paper proposes a hybrid prediction model based on user clustering.First,a Density-Based Clustering Algorithm(DBSCAN)is used to group users with similar behavioral patterns.Then,a hybrid prediction model combining Generative Adversarial Networks(GANs)and Long Short-Term Memory networks(LSTMs)is designed to effectively mitigate data imbalance and overfitting through collaborative training.Experiments conducted on three real-world datasets from YouTube demonstrate that this approach significantly outperforms existing methods based on user trajectories or video saliency in terms of prediction accuracy and stability.