The advent of self-attention mechanisms within Transformer models has significantly propelled the advancement of deep learning algorithms,yielding outstanding achievements across diverse domains.Nonetheless,self-atten...The advent of self-attention mechanisms within Transformer models has significantly propelled the advancement of deep learning algorithms,yielding outstanding achievements across diverse domains.Nonetheless,self-attention mechanisms falter when applied to datasets with intricate semantic content and extensive dependency structures.In response,this paper introduces a Diffusion Sampling and Label-Driven Co-attention Neural Network(DSLD),which adopts a diffusion sampling method to capture more comprehensive semantic information of the data.Additionally,themodel leverages the joint correlation information of labels and data to introduce the computation of text representation,correcting semantic representationbiases in thedata,andincreasing the accuracyof semantic representation.Ultimately,the model computes the corresponding classification results by synthesizing these rich data semantic representations.Experiments on seven benchmark datasets show that our proposed model achieves competitive results compared to state-of-the-art methods.展开更多
Causality extraction has become a crucial task in natural language processing and knowledge graph.However,most existing methods divide causality extraction into two subtasks:extraction of candidate causal pairs and cl...Causality extraction has become a crucial task in natural language processing and knowledge graph.However,most existing methods divide causality extraction into two subtasks:extraction of candidate causal pairs and classification of causality.These methods result in cascading errors and the loss of associated contextual information.Therefore,in this study,based on graph theory,an End-to-end Multi-Granulation Causality Extraction model(EMGCE)is proposed to extract explicit causality and directly mine implicit causality.First,the sentences are represented on different granulation layers,that contain character,word,and contextual string layers.The word layer is fine-grained into three layers:word-index,word-embedding and word-position-embedding layers.Then,a granular causality tree of dataset is built based on the word-index layer.Next,an improved tagREtriplet algorithm is designed to obtain the labeled causality based on the granular causality tree.It can transform the task into a sequence labeling task.Subsequently,the multi-granulation semantic representation is fed into the neural network model to extract causality.Finally,based on the extended public SemEval 2010 Task 8 dataset,the experimental results demonstrate that EMGCE is effective.展开更多
ExpertRecommendation(ER)aims to identify domain experts with high expertise and willingness to provide answers to questions in Community Question Answering(CQA)web services.How to model questions and users in the hete...ExpertRecommendation(ER)aims to identify domain experts with high expertise and willingness to provide answers to questions in Community Question Answering(CQA)web services.How to model questions and users in the heterogeneous content network is critical to this task.Most traditional methods focus on modeling questions and users based on the textual content left in the community while ignoring the structural properties of heterogeneous CQA networks and always suffering from textual data sparsity issues.Recent approaches take advantage of structural proximities between nodes and attempt to fuse the textual content of nodes for modeling.However,they often fail to distinguish the nodes’personalized preferences and only consider the textual content of a part of the nodes in network embedding learning,while ignoring the semantic relevance of nodes.In this paper,we propose a novel framework that jointly considers the structural proximity relations and textual semantic relevance to model users and questions more comprehensively.Specifically,we learn topology-based embeddings through a hierarchical attentive network learning strategy,in which the proximity information and the personalized preference of nodes are encoded and preserved.Meanwhile,we utilize the node’s textual content and the text correlation between adjacent nodes to build the content-based embedding through a meta-context-aware skip-gram model.In addition,the user’s relative answer quality is incorporated to promote the ranking performance.Experimental results show that our proposed framework consistently and significantly outperforms the state-of-the-art baselines on three real-world datasets by taking the deep semantic understanding and structural feature learning together.The performance of the proposed work is analyzed in terms of MRR,P@K,and MAP and is proven to be more advanced than the existing methodologies.展开更多
The byte stream is widely used in malware detection due to its independence of reverse engineering.However,existing methods based on the byte stream implement an indiscriminate feature extraction strategy,which ignore...The byte stream is widely used in malware detection due to its independence of reverse engineering.However,existing methods based on the byte stream implement an indiscriminate feature extraction strategy,which ignores the byte function difference in different segments and fails to achieve targeted feature extraction for various byte semantic representation modes,resulting in byte semantic confusion.To address this issue,an enhanced adversarial byte function associated method for malware backdoor attack is proposed in this paper by categorizing various function bytes into three functions involving structure,code,and data.The Minhash algorithm,grayscale mapping,and state transition probability statistics are then used to capture byte semantics from the perspectives of text signature,spatial structure,and statistical aspects,respectively,to increase the accuracy of byte semantic representation.Finally,the three-channel malware feature image is constructed based on different function byte semantics,and a convolutional neural network is applied for detection.Experiments on multiple data sets from 2018 to 2021 show that the method can effectively combine byte functions to achieve targeted feature extraction,avoid byte semantic confusion,and improve the accuracy of malware detection.展开更多
Satellite remote sensing,characterized by extensive coverage,fre-quent revisits,and continuous monitoring,provides essential data support for addressing global challenges.Over the past six decades,thousands of Earth o...Satellite remote sensing,characterized by extensive coverage,fre-quent revisits,and continuous monitoring,provides essential data support for addressing global challenges.Over the past six decades,thousands of Earth observation satellites and sensors have been deployed worldwide.These valuable Earth observation assets are contributed independently by various nations and organizations employing diverse methodologies.This poses a significant challenge in effectively discovering global Earth observation resources and realizing their full potential.In this paper,we describe the develop-ment of GEOSatDB,the most complete semantic database of civil Earth observation satellites developed based on a unified ontology model.A similarity matching method is used to integrate satellite information and a prompt strategy is used to extract unstructured sensor information.The resulting semantic database contains 127,949 semantic statements for 2,340 remote sensing satellites and 1,021 observation sensors.The global Earth observation capabil-ities of 195 countries worldwide have been analyzed in detail,and a concrete use case along with an associated query demonstration is presented.This database provides significant value in effectively facilitating the semantic understanding and sharing of Earth observa-tion resources.展开更多
Massive ocean data acquired by various observing platforms and sensors poses new challenges to data management and utilization.Typically,it is difficult to find the desired data from the large amount of datasets effic...Massive ocean data acquired by various observing platforms and sensors poses new challenges to data management and utilization.Typically,it is difficult to find the desired data from the large amount of datasets efficiently and effectively.Most of existing methods for data discovery are based on the keyword retrieval or direct semantic reasoning,and they are either limited in data access rate or do not take the time cost into account.In this paper,we creatively design and implement a novel system to alleviate the problem by introducing semantics with ontologies,which is referred to as Data Ontology and List-Based Publishing(DOLP).Specifically,we mainly improve the ocean data services in the following three aspects.First,we propose a unified semantic model called OEDO(Ocean Environmental Data Ontology)to represent heterogeneous ocean data by metadata and to be published as data services.Second,we propose an optimized quick service query list(QSQL)data structure for storing the pre-inferred semantically related services,and reducing the service querying time.Third,we propose two algorithms for optimizing QSQL hierarchically and horizontally,respectively,which aim to extend the semantics relationships of the data service and improve the data access rate.Experimental results prove that DOLP outperforms the benchmark methods.First,our QSQL-based data discovery methods obtain a higher recall rate than the keyword-based method,and are faster than the traditional semantic method based on direct reasoning.Second,DOLP can handle more complex semantic relationships than the existing methods.展开更多
Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation va...Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation variations,and obfuscation.Recent advances in artificial intelligence—particularly natural language processing(NLP),graph representation learning(GRL),and large language models(LLMs)—have markedly improved accuracy,enabling better recognition of code variants and deeper semantic understanding.This paper presents a comprehensive review of 82 studies published between 1975 and 2025,systematically tracing the historical evolution of BCSD and analyzing the progressive incorporation of artificial intelligence(AI)techniques.Particular emphasis is placed on the role of LLMs,which have recently emerged as transformative tools in advancing semantic representation and enhancing detection performance.The review is organized around five central research questions:(1)the chronological development and milestones of BCSD;(2)the construction of AI-driven technical roadmaps that chart methodological transitions;(3)the design and implementation of general analytical workflows for binary code analysis;(4)the applicability,strengths,and limitations of LLMs in capturing semantic and structural features of binary code;and(5)the persistent challenges and promising directions for future investigation.By synthesizing insights across these dimensions,the study demonstrates how LLMs reshape the landscape of binary code analysis,offering unprecedented opportunities to improve accuracy,scalability,and adaptability in real-world scenarios.This review not only bridges a critical gap in the existing literature but also provides a forward-looking perspective,serving as a valuable reference for researchers and practitioners aiming to advance AI-powered BCSD methodologies and applications.展开更多
Accurate classification of cassava disease,particularly in field scenarios,relies on object semantic localization to identify and precisely locate specific objects within an image based on their semantic meaning,there...Accurate classification of cassava disease,particularly in field scenarios,relies on object semantic localization to identify and precisely locate specific objects within an image based on their semantic meaning,thereby enabling targeted classification while suppressing rrelevant noise and focusing on key semantic features.The advancement of deep convolutional neural networks(CNNs)paved the way for identifying cassava diseases by leveraging salient semantic features and promising high returns.This study proposes an approach that incorporates three innovative elements to refine feature representation for cassava disease classification.First,a mutualattention method is introduced to highlight semantic features and suppress irrelevant background features in the feature maps.Second,instance batch normalization(IBN)was employed after the residual unit to construct salient semantic features using the mutualattention method,representing high-quality semantic features in the foreground.Finally,the RSigELUD activation method replaced the conventional ReLU activation,enhancing the nonlinear mapping capacity of the proposed neural network and further improving fine-grained leaf disease classification performance This approach significantly aided in distinguishing subtle disease manifestations in cassava leaves.The proposed neural network,MAIRNet-101(Mutualattention IBN RSigELUD Neural Network),achieved an accuracy of 95.30%and an F1-score of 0.9531,outperforming EfficientNet-B5 and RepVGG-B3g4.To evaluate the generalization capability of MAIRNet,the FGVC-Aircraft dataset was used to train MAIRNet-50,which achieved an accuracy of 83.64%.These results suggest that the proposed algorithm is well suited for cassava leaf disease classification applications and offers a robust solution for advancing agricultural technology.展开更多
With the ever-growing dynamicity, complexity, technique is proposed and becomes one of the most effective and volume of information resources, the recommendation techniques for solving the so-called problem of informa...With the ever-growing dynamicity, complexity, technique is proposed and becomes one of the most effective and volume of information resources, the recommendation techniques for solving the so-called problem of information overload. Traditional recommendation algorithms, such as collaborative filtering based on the user or item, only measure the degree of similarity between users or items with single criterion, i.e., ratings. According to the experience of previous studies, single criterion cannot accurately measure the similarity between user preferences or items. In recent years, the application of deep learning techniques has gained significant momentum in recommender systems for better understanding of user preferences, item characteristics, and historical interactions. In this work, we integrate plot information as auxiliary information into the denoising autoencoder (DAE), called SemRe-DCF, which aims at learning semantic representations of item descriptions and succeeds in capturing fine-grained semantic regularities by using vector arithmetic to get better rating prediction. The results manifest that the proposed method can effectively improve the accuracy of prediction and solve the cold start problem.展开更多
基金the Communication University of China(CUC230A013)the Fundamental Research Funds for the Central Universities.
文摘The advent of self-attention mechanisms within Transformer models has significantly propelled the advancement of deep learning algorithms,yielding outstanding achievements across diverse domains.Nonetheless,self-attention mechanisms falter when applied to datasets with intricate semantic content and extensive dependency structures.In response,this paper introduces a Diffusion Sampling and Label-Driven Co-attention Neural Network(DSLD),which adopts a diffusion sampling method to capture more comprehensive semantic information of the data.Additionally,themodel leverages the joint correlation information of labels and data to introduce the computation of text representation,correcting semantic representationbiases in thedata,andincreasing the accuracyof semantic representation.Ultimately,the model computes the corresponding classification results by synthesizing these rich data semantic representations.Experiments on seven benchmark datasets show that our proposed model achieves competitive results compared to state-of-the-art methods.
基金supported in part by the National Natural Science Foundation of China(No.62221005)the National Key Research and Development Program of China(No.2021YFF0704101,No.2020YFC2003502)+2 种基金the National Natural Science Foundation of China(No.61876201)the Natural Science Foundation of Chongqing(No.cstc2019jcyj-cxtt X0002,No.cstc2021ycjh-bgzxm0013)the key cooperation project of chongqing municipal education commission(HZ2021008)。
文摘Causality extraction has become a crucial task in natural language processing and knowledge graph.However,most existing methods divide causality extraction into two subtasks:extraction of candidate causal pairs and classification of causality.These methods result in cascading errors and the loss of associated contextual information.Therefore,in this study,based on graph theory,an End-to-end Multi-Granulation Causality Extraction model(EMGCE)is proposed to extract explicit causality and directly mine implicit causality.First,the sentences are represented on different granulation layers,that contain character,word,and contextual string layers.The word layer is fine-grained into three layers:word-index,word-embedding and word-position-embedding layers.Then,a granular causality tree of dataset is built based on the word-index layer.Next,an improved tagREtriplet algorithm is designed to obtain the labeled causality based on the granular causality tree.It can transform the task into a sequence labeling task.Subsequently,the multi-granulation semantic representation is fed into the neural network model to extract causality.Finally,based on the extended public SemEval 2010 Task 8 dataset,the experimental results demonstrate that EMGCE is effective.
文摘ExpertRecommendation(ER)aims to identify domain experts with high expertise and willingness to provide answers to questions in Community Question Answering(CQA)web services.How to model questions and users in the heterogeneous content network is critical to this task.Most traditional methods focus on modeling questions and users based on the textual content left in the community while ignoring the structural properties of heterogeneous CQA networks and always suffering from textual data sparsity issues.Recent approaches take advantage of structural proximities between nodes and attempt to fuse the textual content of nodes for modeling.However,they often fail to distinguish the nodes’personalized preferences and only consider the textual content of a part of the nodes in network embedding learning,while ignoring the semantic relevance of nodes.In this paper,we propose a novel framework that jointly considers the structural proximity relations and textual semantic relevance to model users and questions more comprehensively.Specifically,we learn topology-based embeddings through a hierarchical attentive network learning strategy,in which the proximity information and the personalized preference of nodes are encoded and preserved.Meanwhile,we utilize the node’s textual content and the text correlation between adjacent nodes to build the content-based embedding through a meta-context-aware skip-gram model.In addition,the user’s relative answer quality is incorporated to promote the ranking performance.Experimental results show that our proposed framework consistently and significantly outperforms the state-of-the-art baselines on three real-world datasets by taking the deep semantic understanding and structural feature learning together.The performance of the proposed work is analyzed in terms of MRR,P@K,and MAP and is proven to be more advanced than the existing methodologies.
基金This work is supported in part by the Information Security Software Project(2020)of the Ministry of Industry and Information Technology,PR China under Grant CEIEC-2020-ZM02-0134.
文摘The byte stream is widely used in malware detection due to its independence of reverse engineering.However,existing methods based on the byte stream implement an indiscriminate feature extraction strategy,which ignores the byte function difference in different segments and fails to achieve targeted feature extraction for various byte semantic representation modes,resulting in byte semantic confusion.To address this issue,an enhanced adversarial byte function associated method for malware backdoor attack is proposed in this paper by categorizing various function bytes into three functions involving structure,code,and data.The Minhash algorithm,grayscale mapping,and state transition probability statistics are then used to capture byte semantics from the perspectives of text signature,spatial structure,and statistical aspects,respectively,to increase the accuracy of byte semantic representation.Finally,the three-channel malware feature image is constructed based on different function byte semantics,and a convolutional neural network is applied for detection.Experiments on multiple data sets from 2018 to 2021 show that the method can effectively combine byte functions to achieve targeted feature extraction,avoid byte semantic confusion,and improve the accuracy of malware detection.
基金supported by the Major Program of the National Natural Science Foundation of China[42090015].
文摘Satellite remote sensing,characterized by extensive coverage,fre-quent revisits,and continuous monitoring,provides essential data support for addressing global challenges.Over the past six decades,thousands of Earth observation satellites and sensors have been deployed worldwide.These valuable Earth observation assets are contributed independently by various nations and organizations employing diverse methodologies.This poses a significant challenge in effectively discovering global Earth observation resources and realizing their full potential.In this paper,we describe the develop-ment of GEOSatDB,the most complete semantic database of civil Earth observation satellites developed based on a unified ontology model.A similarity matching method is used to integrate satellite information and a prompt strategy is used to extract unstructured sensor information.The resulting semantic database contains 127,949 semantic statements for 2,340 remote sensing satellites and 1,021 observation sensors.The global Earth observation capabil-ities of 195 countries worldwide have been analyzed in detail,and a concrete use case along with an associated query demonstration is presented.This database provides significant value in effectively facilitating the semantic understanding and sharing of Earth observa-tion resources.
基金supported by the National Key Research and Development Program of China under Grant No.2018YFB0203801the National Natural Science Foundation of China under Grant Nos.61702529 and 61802424.
文摘Massive ocean data acquired by various observing platforms and sensors poses new challenges to data management and utilization.Typically,it is difficult to find the desired data from the large amount of datasets efficiently and effectively.Most of existing methods for data discovery are based on the keyword retrieval or direct semantic reasoning,and they are either limited in data access rate or do not take the time cost into account.In this paper,we creatively design and implement a novel system to alleviate the problem by introducing semantics with ontologies,which is referred to as Data Ontology and List-Based Publishing(DOLP).Specifically,we mainly improve the ocean data services in the following three aspects.First,we propose a unified semantic model called OEDO(Ocean Environmental Data Ontology)to represent heterogeneous ocean data by metadata and to be published as data services.Second,we propose an optimized quick service query list(QSQL)data structure for storing the pre-inferred semantically related services,and reducing the service querying time.Third,we propose two algorithms for optimizing QSQL hierarchically and horizontally,respectively,which aim to extend the semantics relationships of the data service and improve the data access rate.Experimental results prove that DOLP outperforms the benchmark methods.First,our QSQL-based data discovery methods obtain a higher recall rate than the keyword-based method,and are faster than the traditional semantic method based on direct reasoning.Second,DOLP can handle more complex semantic relationships than the existing methods.
文摘Binary Code Similarity Detection(BCSD)is vital for vulnerability discovery,malware detection,and software security,especially when source code is unavailable.Yet,it faces challenges from semantic loss,recompilation variations,and obfuscation.Recent advances in artificial intelligence—particularly natural language processing(NLP),graph representation learning(GRL),and large language models(LLMs)—have markedly improved accuracy,enabling better recognition of code variants and deeper semantic understanding.This paper presents a comprehensive review of 82 studies published between 1975 and 2025,systematically tracing the historical evolution of BCSD and analyzing the progressive incorporation of artificial intelligence(AI)techniques.Particular emphasis is placed on the role of LLMs,which have recently emerged as transformative tools in advancing semantic representation and enhancing detection performance.The review is organized around five central research questions:(1)the chronological development and milestones of BCSD;(2)the construction of AI-driven technical roadmaps that chart methodological transitions;(3)the design and implementation of general analytical workflows for binary code analysis;(4)the applicability,strengths,and limitations of LLMs in capturing semantic and structural features of binary code;and(5)the persistent challenges and promising directions for future investigation.By synthesizing insights across these dimensions,the study demonstrates how LLMs reshape the landscape of binary code analysis,offering unprecedented opportunities to improve accuracy,scalability,and adaptability in real-world scenarios.This review not only bridges a critical gap in the existing literature but also provides a forward-looking perspective,serving as a valuable reference for researchers and practitioners aiming to advance AI-powered BCSD methodologies and applications.
文摘Accurate classification of cassava disease,particularly in field scenarios,relies on object semantic localization to identify and precisely locate specific objects within an image based on their semantic meaning,thereby enabling targeted classification while suppressing rrelevant noise and focusing on key semantic features.The advancement of deep convolutional neural networks(CNNs)paved the way for identifying cassava diseases by leveraging salient semantic features and promising high returns.This study proposes an approach that incorporates three innovative elements to refine feature representation for cassava disease classification.First,a mutualattention method is introduced to highlight semantic features and suppress irrelevant background features in the feature maps.Second,instance batch normalization(IBN)was employed after the residual unit to construct salient semantic features using the mutualattention method,representing high-quality semantic features in the foreground.Finally,the RSigELUD activation method replaced the conventional ReLU activation,enhancing the nonlinear mapping capacity of the proposed neural network and further improving fine-grained leaf disease classification performance This approach significantly aided in distinguishing subtle disease manifestations in cassava leaves.The proposed neural network,MAIRNet-101(Mutualattention IBN RSigELUD Neural Network),achieved an accuracy of 95.30%and an F1-score of 0.9531,outperforming EfficientNet-B5 and RepVGG-B3g4.To evaluate the generalization capability of MAIRNet,the FGVC-Aircraft dataset was used to train MAIRNet-50,which achieved an accuracy of 83.64%.These results suggest that the proposed algorithm is well suited for cassava leaf disease classification applications and offers a robust solution for advancing agricultural technology.
基金This work was supported by the National Natural Science Foundation of China under Grant Nos. 71473035 and 11501095, the Fundamental Research Funds for the Central Universities of China under Grant No. 2412017QD028, the China Postdoctoral Science Foundation under Grant No. 2017M021192, the Scientific and Technological Development Program of Jilin Province of China under Grant Nos. 20180520022JH, 20150204040GX, and 20170520051JH, Jilin Province Development and Reform Commission Project of China under Grant Nos. 2015Y055 and 2015Y054, and the Natural Science Foundation of Jilin Province of China under Grant No. 20150101057JC.
文摘With the ever-growing dynamicity, complexity, technique is proposed and becomes one of the most effective and volume of information resources, the recommendation techniques for solving the so-called problem of information overload. Traditional recommendation algorithms, such as collaborative filtering based on the user or item, only measure the degree of similarity between users or items with single criterion, i.e., ratings. According to the experience of previous studies, single criterion cannot accurately measure the similarity between user preferences or items. In recent years, the application of deep learning techniques has gained significant momentum in recommender systems for better understanding of user preferences, item characteristics, and historical interactions. In this work, we integrate plot information as auxiliary information into the denoising autoencoder (DAE), called SemRe-DCF, which aims at learning semantic representations of item descriptions and succeeds in capturing fine-grained semantic regularities by using vector arithmetic to get better rating prediction. The results manifest that the proposed method can effectively improve the accuracy of prediction and solve the cold start problem.