This paper proposes machine learning techniques to discover knowledge in a dataset in the form of if-then rules for the purpose of formulating queries for validation of a Bayesian belief network model of the same data...This paper proposes machine learning techniques to discover knowledge in a dataset in the form of if-then rules for the purpose of formulating queries for validation of a Bayesian belief network model of the same data. Although do-main expertise is often available, the query formulation task is tedious and laborious, and hence automation of query formulation is desirable. In an effort to automate the query formulation process, a machine learning algorithm is lev-eraged to discover knowledge in the form of if-then rules in the data from which the Bayesian belief network model under validation was also induced. The set of if-then rules are processed and filtered through domain expertise to identify a subset that consists of “interesting” and “significant” rules. The subset of interesting and significant rules is formulated into corresponding queries to be posed, for validation purposes, to the Bayesian belief network induced from the same dataset. The promise of the proposed methodology was assessed through an empirical study performed on a real-life dataset, the National Crime Victimization Survey, which has over 250 attributes and well over 200,000 data points. The study demonstrated that the proposed approach is feasible and provides automation, in part, of the query formulation process for validation of a complex probabilistic model, which culminates in substantial savings for the need for human expert involvement and investment.展开更多
Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Gene...Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.展开更多
In the current biomedical data movement, numerous efforts have been made to convert and normalize a large number of traditional structured and unstructured data (e.g., EHRs, reports) to semi-structured data (e.g., RDF...In the current biomedical data movement, numerous efforts have been made to convert and normalize a large number of traditional structured and unstructured data (e.g., EHRs, reports) to semi-structured data (e.g., RDF, OWL). With the increasing number of semi-structured data coming into the biomedical community, data integration and knowledge discovery from heterogeneous domains become important research problem. In the application level, detection of related concepts among medical ontologies is an important goal of life science research. It is more crucial to figure out how different concepts are related within a single ontology or across multiple ontologies by analysing predicates in different knowledge bases. However, the world today is one of information explosion, and it is extremely difficult for biomedical researchers to find existing or potential predicates to perform linking among cross domain concepts without any support from schema pattern analysis. Therefore, there is a need for a mechanism to do predicate oriented pattern analysis to partition heterogeneous ontologies into closer small topics and do query generation to discover cross domain knowledge from each topic. In this paper, we present such a model that predicates oriented pattern analysis based on their close relationship and generates a similarity matrix. Based on this similarity matrix, we apply an innovated unsupervised learning algorithm to partition large data sets into smaller and closer topics and generate meaningful queries to fully discover knowledge over a set of interlinked data sources. We have implemented a prototype system named BmQGen and evaluate the proposed model with colorectal surgical cohort from the Mayo Clinic.展开更多
The social internet of things(SIoT)is one of the emerging paradigms that was proposed to solve the problems of network service discovery,navigability,and service composition.The SIoT aims to socialize the IoT devices ...The social internet of things(SIoT)is one of the emerging paradigms that was proposed to solve the problems of network service discovery,navigability,and service composition.The SIoT aims to socialize the IoT devices and shape the interconnection between them into social interaction just like human beings.In IoT,an object can offer multiple services and different objects can offer the same services with different parameters and interest factors.The proliferation of offered services led to difficulties during service customization and service filtering.This problem is known as service explosion.The selection of suitable service that fits the requirements of applications and objects is a challenging task.To address these issues,we propose an efficient automated query-based service search model based on the local network navigability concept for the SIoT.In the proposed model,objects can use information from their friends or friends of their friends while searching for the desired services,rather than exploring a global network.We employ a centrality metric that computes the degree of importance for each object in the social IoT that helps in selecting neighboring objects with high centrality scores.The distributed nature of our navigation model results in high scalability and short navigation times.We verified the efficacy of our model on a real-world SIoT-related dataset.The experimental results confirm the validity of our model in terms of scalability,navigability,and the desired objects that provide services are determined quickly via the shortest path,which in return improves the service search process in the SIoT.展开更多
The need to perform spatial queries and searches is commonly encountered within the field of computational physics.The development of applications ranging from scientific visualization to finite element analysis requi...The need to perform spatial queries and searches is commonly encountered within the field of computational physics.The development of applications ranging from scientific visualization to finite element analysis requires efficient methods of locating domain objects relative to general locations in space.Much of the time,it is possible to form and maintain spatial relationships between objects either explicitly or by using relative motion constraints as the application evolves in time.Occasionally,either due to unpredictable relative motion or the lack of state information,an application must perform a general search(or ordering)of geometric objects without any explicit spatial relationship information as a basis.If previous state information involving domain geometric objects is not available,it is typically an involved and time consuming process to create object adjacency information or to order the objects in space.Further,as the number of objects and the spatial dimension of the problem domain is increased,the time required to search increases greatly.This paper proposes an implementation of a spatial k-d tree(skD-tree)for use by various applications when a general domain search is required.The skD-tree proposed in this paper is a spatial access method where successive tree levels are split along different dimensions.Objects are indexed by their centroid,and the minimum bounding box of objects in a node are stored in the tree node.The paper focuses on a discussion of efficient and practical algorithms for multidimensional spatial data structures for fast spatial query processing.These functions include the construction of a skD-tree of geometric objects,intersection query,containment query,and nearest neighbor query operations.展开更多
文摘This paper proposes machine learning techniques to discover knowledge in a dataset in the form of if-then rules for the purpose of formulating queries for validation of a Bayesian belief network model of the same data. Although do-main expertise is often available, the query formulation task is tedious and laborious, and hence automation of query formulation is desirable. In an effort to automate the query formulation process, a machine learning algorithm is lev-eraged to discover knowledge in the form of if-then rules in the data from which the Bayesian belief network model under validation was also induced. The set of if-then rules are processed and filtered through domain expertise to identify a subset that consists of “interesting” and “significant” rules. The subset of interesting and significant rules is formulated into corresponding queries to be posed, for validation purposes, to the Bayesian belief network induced from the same dataset. The promise of the proposed methodology was assessed through an empirical study performed on a real-life dataset, the National Crime Victimization Survey, which has over 250 attributes and well over 200,000 data points. The study demonstrated that the proposed approach is feasible and provides automation, in part, of the query formulation process for validation of a complex probabilistic model, which culminates in substantial savings for the need for human expert involvement and investment.
基金supported by the National Social Science Foundation of China(No.14CTQ032)the National Natural Science Foundation of China(No.61370170)
文摘Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.
文摘In the current biomedical data movement, numerous efforts have been made to convert and normalize a large number of traditional structured and unstructured data (e.g., EHRs, reports) to semi-structured data (e.g., RDF, OWL). With the increasing number of semi-structured data coming into the biomedical community, data integration and knowledge discovery from heterogeneous domains become important research problem. In the application level, detection of related concepts among medical ontologies is an important goal of life science research. It is more crucial to figure out how different concepts are related within a single ontology or across multiple ontologies by analysing predicates in different knowledge bases. However, the world today is one of information explosion, and it is extremely difficult for biomedical researchers to find existing or potential predicates to perform linking among cross domain concepts without any support from schema pattern analysis. Therefore, there is a need for a mechanism to do predicate oriented pattern analysis to partition heterogeneous ontologies into closer small topics and do query generation to discover cross domain knowledge from each topic. In this paper, we present such a model that predicates oriented pattern analysis based on their close relationship and generates a similarity matrix. Based on this similarity matrix, we apply an innovated unsupervised learning algorithm to partition large data sets into smaller and closer topics and generate meaningful queries to fully discover knowledge over a set of interlinked data sources. We have implemented a prototype system named BmQGen and evaluate the proposed model with colorectal surgical cohort from the Mayo Clinic.
基金This work was supported by the National Research Foundation of Korea(NRF)grant funded by the Korean government(MSIT)(2020R1A2B5B01002145).
文摘The social internet of things(SIoT)is one of the emerging paradigms that was proposed to solve the problems of network service discovery,navigability,and service composition.The SIoT aims to socialize the IoT devices and shape the interconnection between them into social interaction just like human beings.In IoT,an object can offer multiple services and different objects can offer the same services with different parameters and interest factors.The proliferation of offered services led to difficulties during service customization and service filtering.This problem is known as service explosion.The selection of suitable service that fits the requirements of applications and objects is a challenging task.To address these issues,we propose an efficient automated query-based service search model based on the local network navigability concept for the SIoT.In the proposed model,objects can use information from their friends or friends of their friends while searching for the desired services,rather than exploring a global network.We employ a centrality metric that computes the degree of importance for each object in the social IoT that helps in selecting neighboring objects with high centrality scores.The distributed nature of our navigation model results in high scalability and short navigation times.We verified the efficacy of our model on a real-world SIoT-related dataset.The experimental results confirm the validity of our model in terms of scalability,navigability,and the desired objects that provide services are determined quickly via the shortest path,which in return improves the service search process in the SIoT.
基金by contractors of the U.S.Government under Contract Nos.DE-AC05-00OR22725 and DE-AC07-05ID14517.
文摘The need to perform spatial queries and searches is commonly encountered within the field of computational physics.The development of applications ranging from scientific visualization to finite element analysis requires efficient methods of locating domain objects relative to general locations in space.Much of the time,it is possible to form and maintain spatial relationships between objects either explicitly or by using relative motion constraints as the application evolves in time.Occasionally,either due to unpredictable relative motion or the lack of state information,an application must perform a general search(or ordering)of geometric objects without any explicit spatial relationship information as a basis.If previous state information involving domain geometric objects is not available,it is typically an involved and time consuming process to create object adjacency information or to order the objects in space.Further,as the number of objects and the spatial dimension of the problem domain is increased,the time required to search increases greatly.This paper proposes an implementation of a spatial k-d tree(skD-tree)for use by various applications when a general domain search is required.The skD-tree proposed in this paper is a spatial access method where successive tree levels are split along different dimensions.Objects are indexed by their centroid,and the minimum bounding box of objects in a node are stored in the tree node.The paper focuses on a discussion of efficient and practical algorithms for multidimensional spatial data structures for fast spatial query processing.These functions include the construction of a skD-tree of geometric objects,intersection query,containment query,and nearest neighbor query operations.