To address the underutilization of Chinese research materials in nonferrous metals,a method for constructing a domain of nonferrous metals knowledge graph(DNMKG)was established.Starting from a domain thesaurus,entitie...To address the underutilization of Chinese research materials in nonferrous metals,a method for constructing a domain of nonferrous metals knowledge graph(DNMKG)was established.Starting from a domain thesaurus,entities and relationships were mapped as resource description framework(RDF)triples to form the graph’s framework.Properties and related entities were extracted from open knowledge bases,enriching the graph.A large-scale,multi-source heterogeneous corpus of over 1×10^(9) words was compiled from recent literature to further expand DNMKG.Using the knowledge graph as prior knowledge,natural language processing techniques were applied to the corpus,generating word vectors.A novel entity evaluation algorithm was used to identify and extract real domain entities,which were added to DNMKG.A prototype system was developed to visualize the knowledge graph and support human−computer interaction.Results demonstrate that DNMKG can enhance knowledge discovery and improve research efficiency in the nonferrous metals field.展开更多
文本的语义表示是自然语言处理和机器学习领域的研究难点,针对目前文本表示中的语义缺失问题,基于LDA主题模型和Word2vec模型,提出一种新的文本语义增强方法Sem2vec(semantic to vector)模型。该模型利用LDA主题模型获得单词的主题分布...文本的语义表示是自然语言处理和机器学习领域的研究难点,针对目前文本表示中的语义缺失问题,基于LDA主题模型和Word2vec模型,提出一种新的文本语义增强方法Sem2vec(semantic to vector)模型。该模型利用LDA主题模型获得单词的主题分布,计算单词与其上下文词的主题相似度,作为主题语义信息融入到词向量中,代替one-hot向量输入至Sem2vec模型,在最大化对数似然目标函数约束下,训练Sem2vec模型的最优参数,最终输出增强的语义词向量表示,并进一步得到文本的语义增强表示。在不同数据集上的实验结果表明,相比其他经典模型,Sem2vec模型的语义词向量之间的语义相似度计算更为准确。另外,根据Sem2vec模型得到的文本语义向量,在多种文本分类算法上的分类结果,较其他经典模型可以提升0.58%~3.5%,同时也提升了时间性能。展开更多
针对服务器的故障检测精度较低导致服务器维护成本增加的问题,提出了一种利用大型语言模型实现服务器故障的检测方法。利用BERT(Bidirectional Encoder Representation of Transformer)大型语言模型对服务器运行状态文本进行语义分析,...针对服务器的故障检测精度较低导致服务器维护成本增加的问题,提出了一种利用大型语言模型实现服务器故障的检测方法。利用BERT(Bidirectional Encoder Representation of Transformer)大型语言模型对服务器运行状态文本进行语义分析,生成高维词向量,以充分捕捉文本中的语义信息。基于生成的词向量,计算各词向量的权重值和互信息值,筛选出对故障检测具有显著贡献的关键词向量,从而降低数据维度并提升特征提取的准确性。将筛选出的关键词向量作为输入,利用GG(Gaussian-Gamma)聚类算法进行聚类分析,通过迭代优化聚类中心和隶属度矩阵,将服务器运行状态划分为正常状态和故障状态,并进一步识别具体故障类型。实验结果表明,该方法在关键词向量提取和故障检测性能上均表现出色,能够有效提升服务器故障检测的精度和效率,为降低服务器维护成本提供了可靠的技术支持。展开更多
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ...One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.展开更多
文摘To address the underutilization of Chinese research materials in nonferrous metals,a method for constructing a domain of nonferrous metals knowledge graph(DNMKG)was established.Starting from a domain thesaurus,entities and relationships were mapped as resource description framework(RDF)triples to form the graph’s framework.Properties and related entities were extracted from open knowledge bases,enriching the graph.A large-scale,multi-source heterogeneous corpus of over 1×10^(9) words was compiled from recent literature to further expand DNMKG.Using the knowledge graph as prior knowledge,natural language processing techniques were applied to the corpus,generating word vectors.A novel entity evaluation algorithm was used to identify and extract real domain entities,which were added to DNMKG.A prototype system was developed to visualize the knowledge graph and support human−computer interaction.Results demonstrate that DNMKG can enhance knowledge discovery and improve research efficiency in the nonferrous metals field.
文摘文本的语义表示是自然语言处理和机器学习领域的研究难点,针对目前文本表示中的语义缺失问题,基于LDA主题模型和Word2vec模型,提出一种新的文本语义增强方法Sem2vec(semantic to vector)模型。该模型利用LDA主题模型获得单词的主题分布,计算单词与其上下文词的主题相似度,作为主题语义信息融入到词向量中,代替one-hot向量输入至Sem2vec模型,在最大化对数似然目标函数约束下,训练Sem2vec模型的最优参数,最终输出增强的语义词向量表示,并进一步得到文本的语义增强表示。在不同数据集上的实验结果表明,相比其他经典模型,Sem2vec模型的语义词向量之间的语义相似度计算更为准确。另外,根据Sem2vec模型得到的文本语义向量,在多种文本分类算法上的分类结果,较其他经典模型可以提升0.58%~3.5%,同时也提升了时间性能。
文摘针对服务器的故障检测精度较低导致服务器维护成本增加的问题,提出了一种利用大型语言模型实现服务器故障的检测方法。利用BERT(Bidirectional Encoder Representation of Transformer)大型语言模型对服务器运行状态文本进行语义分析,生成高维词向量,以充分捕捉文本中的语义信息。基于生成的词向量,计算各词向量的权重值和互信息值,筛选出对故障检测具有显著贡献的关键词向量,从而降低数据维度并提升特征提取的准确性。将筛选出的关键词向量作为输入,利用GG(Gaussian-Gamma)聚类算法进行聚类分析,通过迭代优化聚类中心和隶属度矩阵,将服务器运行状态划分为正常状态和故障状态,并进一步识别具体故障类型。实验结果表明,该方法在关键词向量提取和故障检测性能上均表现出色,能够有效提升服务器故障检测的精度和效率,为降低服务器维护成本提供了可靠的技术支持。
文摘One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.