文本的语义表示是自然语言处理和机器学习领域的研究难点,针对目前文本表示中的语义缺失问题,基于LDA主题模型和Word2vec模型,提出一种新的文本语义增强方法Sem2vec(semantic to vector)模型。该模型利用LDA主题模型获得单词的主题分布...文本的语义表示是自然语言处理和机器学习领域的研究难点,针对目前文本表示中的语义缺失问题,基于LDA主题模型和Word2vec模型,提出一种新的文本语义增强方法Sem2vec(semantic to vector)模型。该模型利用LDA主题模型获得单词的主题分布,计算单词与其上下文词的主题相似度,作为主题语义信息融入到词向量中,代替one-hot向量输入至Sem2vec模型,在最大化对数似然目标函数约束下,训练Sem2vec模型的最优参数,最终输出增强的语义词向量表示,并进一步得到文本的语义增强表示。在不同数据集上的实验结果表明,相比其他经典模型,Sem2vec模型的语义词向量之间的语义相似度计算更为准确。另外,根据Sem2vec模型得到的文本语义向量,在多种文本分类算法上的分类结果,较其他经典模型可以提升0.58%~3.5%,同时也提升了时间性能。展开更多
【目的】工业控制系统(industrial control system,ICS)中设备间通信过程高度依赖工控协议来实现,协议安全性对保障ICS稳定运行起到关键作用。漏洞挖掘与入侵检测等作为ICS安全防御体系的核心技术组件,其有效性依赖于对工控协议结构及...【目的】工业控制系统(industrial control system,ICS)中设备间通信过程高度依赖工控协议来实现,协议安全性对保障ICS稳定运行起到关键作用。漏洞挖掘与入侵检测等作为ICS安全防御体系的核心技术组件,其有效性依赖于对工控协议结构及语义功能的精确解析。协议逆向分析作为解析协议结构与语义功能的关键技术,其核心环节语义推断精度直接决定协议理解的准确性。然而,受限于工控协议文档缺失、格式异构性强等现实条件,现有语义推断方法普遍依赖专家经验,存在自动化水平不足、跨协议泛化性能有限等固有瓶颈,难以适应实际工业环境中多源异构协议的高精度解析需求。【方法】为解决上述问题,本文提出mBERT协同多源领域自适应与结构化掩码策略的语义推断方法。通过mBERT模型实现跨协议通用语义表示;利用结合注意力权重与位置编码设计的结构化掩码策略,增强模型对协议结构和语义内在联系的表示能力,提高语义推断方法的自动化程度和效率;利用结合对抗训练的多源领域自适应逐步微调策略,提升模型对多个源协议的语义通用表示能力,增强其在多种工控协议上的适用性,实现关键字语义的有效推断。【结果】在辽宁省石油化工行业信息安全重点实验室的典型能源企业攻防演练靶场中开展实验验证,采集了S7comm、Modbus/TCP和EtherNet/IP三种工控协议数据,并利用协议复杂度评分机制组建训练数据集。结果表明,多源领域自适应逐步微调策略能够显著提升模型性能,将其与结构化掩码策略结合,进一步提高了语义推断精度,且本文方法在精确度、召回率与F_(1)分数指标上均显著优于现有基线方法。【结论】本文提出了mBERT协同多源领域自适应与结构化掩码策略的语义推断方法,在语义推断中采用高维球面映射与多任务损失函数,增强了不同语义类别的区分度与模型对协议语义的深层辨识能力。本文方法不仅显著降低了对人工先验知识的依赖,也提升了语义推断效率与跨协议适用性,为工控协议逆向分析及工业系统安全防护提供了具备理论支撑的新路径。展开更多
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ...One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.展开更多
To address the underutilization of Chinese research materials in nonferrous metals,a method for constructing a domain of nonferrous metals knowledge graph(DNMKG)was established.Starting from a domain thesaurus,entitie...To address the underutilization of Chinese research materials in nonferrous metals,a method for constructing a domain of nonferrous metals knowledge graph(DNMKG)was established.Starting from a domain thesaurus,entities and relationships were mapped as resource description framework(RDF)triples to form the graph’s framework.Properties and related entities were extracted from open knowledge bases,enriching the graph.A large-scale,multi-source heterogeneous corpus of over 1×10^(9) words was compiled from recent literature to further expand DNMKG.Using the knowledge graph as prior knowledge,natural language processing techniques were applied to the corpus,generating word vectors.A novel entity evaluation algorithm was used to identify and extract real domain entities,which were added to DNMKG.A prototype system was developed to visualize the knowledge graph and support human−computer interaction.Results demonstrate that DNMKG can enhance knowledge discovery and improve research efficiency in the nonferrous metals field.展开更多
针对服务器的故障检测精度较低导致服务器维护成本增加的问题,提出了一种利用大型语言模型实现服务器故障的检测方法。利用BERT(Bidirectional Encoder Representation of Transformer)大型语言模型对服务器运行状态文本进行语义分析,...针对服务器的故障检测精度较低导致服务器维护成本增加的问题,提出了一种利用大型语言模型实现服务器故障的检测方法。利用BERT(Bidirectional Encoder Representation of Transformer)大型语言模型对服务器运行状态文本进行语义分析,生成高维词向量,以充分捕捉文本中的语义信息。基于生成的词向量,计算各词向量的权重值和互信息值,筛选出对故障检测具有显著贡献的关键词向量,从而降低数据维度并提升特征提取的准确性。将筛选出的关键词向量作为输入,利用GG(Gaussian-Gamma)聚类算法进行聚类分析,通过迭代优化聚类中心和隶属度矩阵,将服务器运行状态划分为正常状态和故障状态,并进一步识别具体故障类型。实验结果表明,该方法在关键词向量提取和故障检测性能上均表现出色,能够有效提升服务器故障检测的精度和效率,为降低服务器维护成本提供了可靠的技术支持。展开更多
文摘文本的语义表示是自然语言处理和机器学习领域的研究难点,针对目前文本表示中的语义缺失问题,基于LDA主题模型和Word2vec模型,提出一种新的文本语义增强方法Sem2vec(semantic to vector)模型。该模型利用LDA主题模型获得单词的主题分布,计算单词与其上下文词的主题相似度,作为主题语义信息融入到词向量中,代替one-hot向量输入至Sem2vec模型,在最大化对数似然目标函数约束下,训练Sem2vec模型的最优参数,最终输出增强的语义词向量表示,并进一步得到文本的语义增强表示。在不同数据集上的实验结果表明,相比其他经典模型,Sem2vec模型的语义词向量之间的语义相似度计算更为准确。另外,根据Sem2vec模型得到的文本语义向量,在多种文本分类算法上的分类结果,较其他经典模型可以提升0.58%~3.5%,同时也提升了时间性能。
文摘【目的】工业控制系统(industrial control system,ICS)中设备间通信过程高度依赖工控协议来实现,协议安全性对保障ICS稳定运行起到关键作用。漏洞挖掘与入侵检测等作为ICS安全防御体系的核心技术组件,其有效性依赖于对工控协议结构及语义功能的精确解析。协议逆向分析作为解析协议结构与语义功能的关键技术,其核心环节语义推断精度直接决定协议理解的准确性。然而,受限于工控协议文档缺失、格式异构性强等现实条件,现有语义推断方法普遍依赖专家经验,存在自动化水平不足、跨协议泛化性能有限等固有瓶颈,难以适应实际工业环境中多源异构协议的高精度解析需求。【方法】为解决上述问题,本文提出mBERT协同多源领域自适应与结构化掩码策略的语义推断方法。通过mBERT模型实现跨协议通用语义表示;利用结合注意力权重与位置编码设计的结构化掩码策略,增强模型对协议结构和语义内在联系的表示能力,提高语义推断方法的自动化程度和效率;利用结合对抗训练的多源领域自适应逐步微调策略,提升模型对多个源协议的语义通用表示能力,增强其在多种工控协议上的适用性,实现关键字语义的有效推断。【结果】在辽宁省石油化工行业信息安全重点实验室的典型能源企业攻防演练靶场中开展实验验证,采集了S7comm、Modbus/TCP和EtherNet/IP三种工控协议数据,并利用协议复杂度评分机制组建训练数据集。结果表明,多源领域自适应逐步微调策略能够显著提升模型性能,将其与结构化掩码策略结合,进一步提高了语义推断精度,且本文方法在精确度、召回率与F_(1)分数指标上均显著优于现有基线方法。【结论】本文提出了mBERT协同多源领域自适应与结构化掩码策略的语义推断方法,在语义推断中采用高维球面映射与多任务损失函数,增强了不同语义类别的区分度与模型对协议语义的深层辨识能力。本文方法不仅显著降低了对人工先验知识的依赖,也提升了语义推断效率与跨协议适用性,为工控协议逆向分析及工业系统安全防护提供了具备理论支撑的新路径。
文摘One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.
文摘To address the underutilization of Chinese research materials in nonferrous metals,a method for constructing a domain of nonferrous metals knowledge graph(DNMKG)was established.Starting from a domain thesaurus,entities and relationships were mapped as resource description framework(RDF)triples to form the graph’s framework.Properties and related entities were extracted from open knowledge bases,enriching the graph.A large-scale,multi-source heterogeneous corpus of over 1×10^(9) words was compiled from recent literature to further expand DNMKG.Using the knowledge graph as prior knowledge,natural language processing techniques were applied to the corpus,generating word vectors.A novel entity evaluation algorithm was used to identify and extract real domain entities,which were added to DNMKG.A prototype system was developed to visualize the knowledge graph and support human−computer interaction.Results demonstrate that DNMKG can enhance knowledge discovery and improve research efficiency in the nonferrous metals field.
文摘针对服务器的故障检测精度较低导致服务器维护成本增加的问题,提出了一种利用大型语言模型实现服务器故障的检测方法。利用BERT(Bidirectional Encoder Representation of Transformer)大型语言模型对服务器运行状态文本进行语义分析,生成高维词向量,以充分捕捉文本中的语义信息。基于生成的词向量,计算各词向量的权重值和互信息值,筛选出对故障检测具有显著贡献的关键词向量,从而降低数据维度并提升特征提取的准确性。将筛选出的关键词向量作为输入,利用GG(Gaussian-Gamma)聚类算法进行聚类分析,通过迭代优化聚类中心和隶属度矩阵,将服务器运行状态划分为正常状态和故障状态,并进一步识别具体故障类型。实验结果表明,该方法在关键词向量提取和故障检测性能上均表现出色,能够有效提升服务器故障检测的精度和效率,为降低服务器维护成本提供了可靠的技术支持。