摘要
对以微博为代表的社交媒体进行流行度预测具有重要价值,受到了广泛的关注。本文考虑信息传播模式之间的异质性,提出了基于传播模式聚类与XGBoost的微博流行度预测算法。在数据预处理阶段,使用K-means++聚类算法,依据早期转发窗口内等时间间隔的转发增量序列,对微博传播模式进行聚类,得到了各个传播模式下的训练子集。在特征提取阶段,提取了相邻转发之间的平均时间间隔、微博首发到第一次转发的时间间隔、等时间间隔的流行度累计序列和流行度增量序列,作为微博数据的时序特征;提取了首发用户的一阶邻居节点数、微博转发的叶子节点数、观察窗口内的流行度和转发路径的平均深度,作为微博数据的结构特征。将这2类特征串联融合得到样本的特征。在离线训练阶段,采用XGBoost集成学习的框架进行回归学习,在不同子集上得到微博流行度预测模型。最后在新浪微博转发数据集上进行实验,验证了本文算法在MSLE和mSLE指标上的有效性。
Predicting the popularity of social media,with Weibo as a representative,holds significant value and has garnered widespread attention.This paper addresses the heterogeneity among information propagation patterns and proposes a Weibo popularity prediction algorithm based on propagation model clustering and XGBoost.In the data preprocessing stage,the K-means++clustering algorithm is employed to cluster Weibo propagation patterns based on the increment sequence of retweets with equal time intervals within the early retweet window.This results in training subsets for different propagation patterns.In the feature extraction stage,temporal features for Weibo data are extracted,including the average time interval between adjacent retweets,the time interval from the initial post to the first retweet,popularity accumulation sequences at equal time intervals,and popularity increment sequences.Additionally,structural features for Weibo data are extracted,such as the first-order neighbor count of the initial user,the leaf node count of Weibo retweets,the popularity within the observation window,and the average depth of retweet paths.These two categories of features are concatenated to create the sample's feature set.During the offline training phase,regression learning is carried out using the XGBoost ensemble learning framework to obtain Weibo popularity prediction models on different subsets.Finally,experiments are conducted on a Sina Weibo retweet dataset to validate the effectiveness of this algorithm in terms of the MSLE and mSLE metrics.
作者
黄德伟
王友国
侯浩杰
HUANG Dewei;WANG Youguo;HOU Haojie(School of Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
出处
《智能计算机与应用》
2025年第5期21-27,共7页
Intelligent Computer and Applications
基金
国家自然科学基金(62071248)。