观点分析对于社交媒体这一关键的网络舆论阵地有着重要的现实意义。该文基于非参数模型的文本聚类技术,将社交媒体文本根据用户主张的观点汇总,直观呈现用户群体所持有的不同立场。针对社交媒体文本长度短、数量多、情感丰富等特点,该...观点分析对于社交媒体这一关键的网络舆论阵地有着重要的现实意义。该文基于非参数模型的文本聚类技术,将社交媒体文本根据用户主张的观点汇总,直观呈现用户群体所持有的不同立场。针对社交媒体文本长度短、数量多、情感丰富等特点,该文提出使用情感分布增强(Sentiment Distribution Enhanced,SDE)方法改进现有基于狄利克雷过程混合模型的短文本流聚类算法,以高斯分布建模文本情感,并推导相应的坍缩吉布斯采样算法推断参数。该方法在捕获文本情感特征的同时,能够自动确定聚类簇数量并实现观点聚类。与现有先进方法在Tweets、Google News数据集上的对比实验显示,该文提出的方法在标准化互信息、准确度等指标上取得了超越现有模型的聚类表现,并且在主观性较强的数据集上具有更显著的优势。展开更多
With the acceleration of global market integration,cross-border e-commerce live streaming has emerged as a new form of international trade,yet scholarly research on its cross-cultural communication remains limited.Thi...With the acceleration of global market integration,cross-border e-commerce live streaming has emerged as a new form of international trade,yet scholarly research on its cross-cultural communication remains limited.This study examines live streaming practices on Amazon and Alibaba International Station to analyze the cross-cultural characteristics of live streaming texts.Effective communication in this context requires anchors to possess solid cultural knowledge,adaptable communicative skills,and an open,inclusive mindset.Drawing on these findings,the paper proposes targeted optimization strategies:strengthening cultural awareness training,localizing live streaming content,and refining both linguistic and non-verbal communication strategies.These measures aim to enable practitioners to better meet the demands of diverse cultural markets,enhance communication effectiveness,and ultimately strengthen competitiveness in the global marketplace.展开更多
The rapid growth of social networks has produced an unprecedented amount of user-generated data, which provides an excellent opportunity for text mining. Authorship analysis, an important part of text mining, attempts...The rapid growth of social networks has produced an unprecedented amount of user-generated data, which provides an excellent opportunity for text mining. Authorship analysis, an important part of text mining, attempts to learn about the author of the text through subtle variations in the writing styles that occur between gender, age and social groups. Such information has a variety of applications including advertising and law enforcement. One of the most accessible sources of user-generated data is Twitter, which makes the majority of its user data freely available through its data access API. In this study we seek to identify the gender of users on Twitter using Perceptron and Nai ve Bayes with selected 1 through 5-gram features from tweet text. Stream applications of these algorithms were employed for gender prediction to handle the speed and volume of tweet traffic. Because informal text, such as tweets, cannot be easily evaluated using traditional dictionary methods, n-gram features were implemented in this study to represent streaming tweets. The large number of 1 through 5-grams requires that only a subset of them be used in gender classification, for this reason informative n-gram features were chosen using multiple selection algorithms. In the best case the Naive Bayes and Perceptron algorithms produced accuracy, balanced accuracy, and F-measure above 99%.展开更多
把文本流中的热点区分为局部热点和全局热点,分析了二者的相关性,并将Kolmogorov复杂度应用于多文本流中的热点挖掘.首先,定义了基于Kolmogorov复杂度的冗余信息的概念,并论证了文本流存在局部热点的必要条件是冗余信息超过某个阈值;其...把文本流中的热点区分为局部热点和全局热点,分析了二者的相关性,并将Kolmogorov复杂度应用于多文本流中的热点挖掘.首先,定义了基于Kolmogorov复杂度的冗余信息的概念,并论证了文本流存在局部热点的必要条件是冗余信息超过某个阈值;其次,基于条件Kolmogorov复杂度提出了一个相似性度量指标——流信息距离(stream information distance,简称SID),以衡量不同文本流之间的相似度;并借鉴计算生物学领域中的种系发生树的思想,提出了一种基于层次聚类的多文本流全局热点挖掘启发式算法.在合成和真实数据集的实验,验证了算法的收敛性、有效性和规模可伸缩性.展开更多
文本会话抽取将网络聊天记录等短文本信息流中的信息根据其所属的会话分检到多个会话队列,有利于短文本信息的管理及进一步的挖掘.现有的会话抽取技术主要对基于文本相似度的聚类方法进行改进,面临着短文本信息流的特征稀疏性、奇异性...文本会话抽取将网络聊天记录等短文本信息流中的信息根据其所属的会话分检到多个会话队列,有利于短文本信息的管理及进一步的挖掘.现有的会话抽取技术主要对基于文本相似度的聚类方法进行改进,面临着短文本信息流的特征稀疏性、奇异性和动态性等挑战.针对这些挑战,研究无监督的会话抽取技术,提出了一种基于信息流时序特征和上下文相关度的抽取方法.首先研究了信息流的会话生命周期规律,提出基于信息产生频率的会话边界检测方法;其次提出信息间的上下文相关度概念,采用基于实例的机器学习方法计算该相关度;最后综合信息产生频率和上下文相关度,设计了基于Single-Pass聚类模型的会话在线抽取算法SPFC(single-pass based on frequency and correlation).真实数据集上的实验结果表明,SPFC算法与已有的基于文本相似度的会话抽取算法相比,F1评测指标提高了30%.展开更多
文摘观点分析对于社交媒体这一关键的网络舆论阵地有着重要的现实意义。该文基于非参数模型的文本聚类技术,将社交媒体文本根据用户主张的观点汇总,直观呈现用户群体所持有的不同立场。针对社交媒体文本长度短、数量多、情感丰富等特点,该文提出使用情感分布增强(Sentiment Distribution Enhanced,SDE)方法改进现有基于狄利克雷过程混合模型的短文本流聚类算法,以高斯分布建模文本情感,并推导相应的坍缩吉布斯采样算法推断参数。该方法在捕获文本情感特征的同时,能够自动确定聚类簇数量并实现观点聚类。与现有先进方法在Tweets、Google News数据集上的对比实验显示,该文提出的方法在标准化互信息、准确度等指标上取得了超越现有模型的聚类表现,并且在主观性较强的数据集上具有更显著的优势。
文摘With the acceleration of global market integration,cross-border e-commerce live streaming has emerged as a new form of international trade,yet scholarly research on its cross-cultural communication remains limited.This study examines live streaming practices on Amazon and Alibaba International Station to analyze the cross-cultural characteristics of live streaming texts.Effective communication in this context requires anchors to possess solid cultural knowledge,adaptable communicative skills,and an open,inclusive mindset.Drawing on these findings,the paper proposes targeted optimization strategies:strengthening cultural awareness training,localizing live streaming content,and refining both linguistic and non-verbal communication strategies.These measures aim to enable practitioners to better meet the demands of diverse cultural markets,enhance communication effectiveness,and ultimately strengthen competitiveness in the global marketplace.
文摘The rapid growth of social networks has produced an unprecedented amount of user-generated data, which provides an excellent opportunity for text mining. Authorship analysis, an important part of text mining, attempts to learn about the author of the text through subtle variations in the writing styles that occur between gender, age and social groups. Such information has a variety of applications including advertising and law enforcement. One of the most accessible sources of user-generated data is Twitter, which makes the majority of its user data freely available through its data access API. In this study we seek to identify the gender of users on Twitter using Perceptron and Nai ve Bayes with selected 1 through 5-gram features from tweet text. Stream applications of these algorithms were employed for gender prediction to handle the speed and volume of tweet traffic. Because informal text, such as tweets, cannot be easily evaluated using traditional dictionary methods, n-gram features were implemented in this study to represent streaming tweets. The large number of 1 through 5-grams requires that only a subset of them be used in gender classification, for this reason informative n-gram features were chosen using multiple selection algorithms. In the best case the Naive Bayes and Perceptron algorithms produced accuracy, balanced accuracy, and F-measure above 99%.
文摘把文本流中的热点区分为局部热点和全局热点,分析了二者的相关性,并将Kolmogorov复杂度应用于多文本流中的热点挖掘.首先,定义了基于Kolmogorov复杂度的冗余信息的概念,并论证了文本流存在局部热点的必要条件是冗余信息超过某个阈值;其次,基于条件Kolmogorov复杂度提出了一个相似性度量指标——流信息距离(stream information distance,简称SID),以衡量不同文本流之间的相似度;并借鉴计算生物学领域中的种系发生树的思想,提出了一种基于层次聚类的多文本流全局热点挖掘启发式算法.在合成和真实数据集的实验,验证了算法的收敛性、有效性和规模可伸缩性.
文摘文本会话抽取将网络聊天记录等短文本信息流中的信息根据其所属的会话分检到多个会话队列,有利于短文本信息的管理及进一步的挖掘.现有的会话抽取技术主要对基于文本相似度的聚类方法进行改进,面临着短文本信息流的特征稀疏性、奇异性和动态性等挑战.针对这些挑战,研究无监督的会话抽取技术,提出了一种基于信息流时序特征和上下文相关度的抽取方法.首先研究了信息流的会话生命周期规律,提出基于信息产生频率的会话边界检测方法;其次提出信息间的上下文相关度概念,采用基于实例的机器学习方法计算该相关度;最后综合信息产生频率和上下文相关度,设计了基于Single-Pass聚类模型的会话在线抽取算法SPFC(single-pass based on frequency and correlation).真实数据集上的实验结果表明,SPFC算法与已有的基于文本相似度的会话抽取算法相比,F1评测指标提高了30%.