This review presents a comprehensive and forward-looking analysis of how Large Language Models(LLMs)are transforming knowledge discovery in the rational design of advancedmicro/nano electrocatalyst materials.Electroca...This review presents a comprehensive and forward-looking analysis of how Large Language Models(LLMs)are transforming knowledge discovery in the rational design of advancedmicro/nano electrocatalyst materials.Electrocatalysis is central to sustainable energy and environmental technologies,but traditional catalyst discovery is often hindered by high complexity,fragmented knowledge,and inefficiencies.LLMs,particularly those based on Transformer architectures,offer unprecedented capabilities in extracting,synthesizing,and generating scientific knowledge from vast unstructured textual corpora.This work provides the first structured synthesis of how LLMs have been leveraged across various electrocatalysis tasks,including automated information extraction from literature,text-based property prediction,hypothesis generation,synthesis planning,and knowledge graph construction.We comparatively analyze leading LLMs and domain-specific frameworks(e.g.,CatBERTa,CataLM,CatGPT)in terms of methodology,application scope,performance metrics,and limitations.Through curated case studies across key electrocatalytic reactions—HER,OER,ORR,and CO_(2)RR—we highlight emerging trends such as the growing use of embedding-based prediction,retrieval-augmented generation,and fine-tuned scientific LLMs.The review also identifies persistent challenges,including data heterogeneity,hallucination risks,lack of standard benchmarks,and limited multimodal integration.Importantly,we articulate future research directions,such as the development of multimodal and physics-informedMatSci-LLMs,enhanced interpretability tools,and the integration of LLMswith selfdriving laboratories for autonomous discovery.By consolidating fragmented advances and outlining a unified research roadmap,this review provides valuable guidance for both materials scientists and AI practitioners seeking to accelerate catalyst innovation through large language model technologies.展开更多
A new structure of ESKD (expert system based on knowledge discovery system KD (D&K)) is first presented on the basis of KD (D&K)-a synthesized knowledge discovery system based on double-base (database and know...A new structure of ESKD (expert system based on knowledge discovery system KD (D&K)) is first presented on the basis of KD (D&K)-a synthesized knowledge discovery system based on double-base (database and knowledge base) cooperating mechanism. With all new features, ESKD may form a new research direction and provide a great probability for solving the wealth of knowledge in the knowledge base. The general structural frame of ESKD and some sub-systems among ESKD have been described, and the dynamic knowledge base based on double-base cooperating mechanism has been emphased on. According to the result of demonstrative experi- ment, the structure of ESKD is effective and feasible.展开更多
Recent developments in database technology have seen a wide variety of data being stored in huge collections. The wide variety makes the analysis tasks of a generic database a strenuous task in knowledge discovery. On...Recent developments in database technology have seen a wide variety of data being stored in huge collections. The wide variety makes the analysis tasks of a generic database a strenuous task in knowledge discovery. One approach is to summarize large datasets in such a way that the resulting summary dataset is of manageable size. Histogram has received significant attention as summarization/representative object for large database. But, it suffers from computational and space complexity. In this paper, we propose an idea to transform the histogram object into a Piecewise Linear Regression (PLR) line object and suggest that PLR objects can be less computational and storage intensive while compared to those of histograms. On the other hand to carry out a cluster analysis, we propose a distance measure for computing the distance between the PLR lines. Case study is presented based on the real data of online education system LMS. This demonstrates that PLR is a powerful knowledge representative for very large database.展开更多
To improve the performance of the multiple classifier system, a new method of feature-decision level fusion is proposed based on knowledge discovery. In the new method, the base classifiers operate on different featur...To improve the performance of the multiple classifier system, a new method of feature-decision level fusion is proposed based on knowledge discovery. In the new method, the base classifiers operate on different feature spaces and their types depend on different measures of between-class separability. The uncertainty measures corresponding to each output of each base classifier are induced from the established decision tables (DTs) in the form of mass function in the Dempster-Shafer theory (DST). Furthermore, an effective fusion framework is built at the feature-decision level on the basis of a generalized rough set model and the DST. The experiment for the classification of hyperspectral remote sensing images shows that the performance of the classification can be improved by the proposed method compared with that of plurality voting (PV).展开更多
To discover the knowledge of fault diagnosis in maintenance record of flexible manufacture system(FMS) equipment. An algorithm (process) was presented, which consists of ① preparatory phase in which some items in mai...To discover the knowledge of fault diagnosis in maintenance record of flexible manufacture system(FMS) equipment. An algorithm (process) was presented, which consists of ① preparatory phase in which some items in maintenance record are selected and decomposed into associated concepts and attributes, and ② discovering and establishing process, in which some possible relationships between the concepts and attributes can be established and knowledge is formulated. The rich diagnosis knowledge in maintenance record was captured through applying the method. An application of the method to the diagnosis system for FMS equipment showed that the approach is correct and effective.展开更多
A new algorithm for the knowledge discovery based on statistic inductionlogic is proposed, and the validity of the methods is verified by examples. The method is suitablefor a large range of knowledge discovery applic...A new algorithm for the knowledge discovery based on statistic inductionlogic is proposed, and the validity of the methods is verified by examples. The method is suitablefor a large range of knowledge discovery applications in the studying of causal relation,uncertainty knowledge acquisition and principal factors analyzing. The language filed description ofthe state space makes the algorithm robust in the adaptation with easier understandable results,which are isomotopy with natural language in the topologic space.展开更多
The 1st International Conference on Data-driven Knowledge Discovery: When Data Science Meets Information Science took place at the National Science Library (NSL), Chinese Academy of Sciences (CAS) in Beijing from...The 1st International Conference on Data-driven Knowledge Discovery: When Data Science Meets Information Science took place at the National Science Library (NSL), Chinese Academy of Sciences (CAS) in Beijing from June 19 till June 22, 2016. The Conference was opened by NSL Director Xiangyang Huang, who placed the event within the goals of the Library, and lauded the spirit of intemational collaboration in the area of data science and knowledge discovery. The whole event was an encouraging success with over 370 registered participants and highly enlightening presentations. The Conference was organized by the Journal of Data andlnformation Science (JDIS) to bring the Joumal to the attention of an international and local audience.展开更多
The present article outlines progress made in designing an intelligent information system for automatic management and knowledge discovery in large numeric and scientific databases, with a validating application to th...The present article outlines progress made in designing an intelligent information system for automatic management and knowledge discovery in large numeric and scientific databases, with a validating application to the CAST-NEONS environmental databases used for ocean modeling and prediction. We describe a discovery-learning process (Automatic Data Analysis System) which combines the features of two machine learning techniques to generate sets of production rules that efficiently describe the observational raw data contained in the database. Data clustering allows the system to classify the raw data into meaningful conceptual clusters, which the system learns by induction to build decision trees, from which are automatically deduced the production rules.展开更多
In the current biomedical data movement, numerous efforts have been made to convert and normalize a large number of traditional structured and unstructured data (e.g., EHRs, reports) to semi-structured data (e.g., RDF...In the current biomedical data movement, numerous efforts have been made to convert and normalize a large number of traditional structured and unstructured data (e.g., EHRs, reports) to semi-structured data (e.g., RDF, OWL). With the increasing number of semi-structured data coming into the biomedical community, data integration and knowledge discovery from heterogeneous domains become important research problem. In the application level, detection of related concepts among medical ontologies is an important goal of life science research. It is more crucial to figure out how different concepts are related within a single ontology or across multiple ontologies by analysing predicates in different knowledge bases. However, the world today is one of information explosion, and it is extremely difficult for biomedical researchers to find existing or potential predicates to perform linking among cross domain concepts without any support from schema pattern analysis. Therefore, there is a need for a mechanism to do predicate oriented pattern analysis to partition heterogeneous ontologies into closer small topics and do query generation to discover cross domain knowledge from each topic. In this paper, we present such a model that predicates oriented pattern analysis based on their close relationship and generates a similarity matrix. Based on this similarity matrix, we apply an innovated unsupervised learning algorithm to partition large data sets into smaller and closer topics and generate meaningful queries to fully discover knowledge over a set of interlinked data sources. We have implemented a prototype system named BmQGen and evaluate the proposed model with colorectal surgical cohort from the Mayo Clinic.展开更多
Purpose: This paper explores a method of knowledge discovery by visualizing and analyzing co-occurrence relations among three or more entities in collections of journal articles.Design/methodology/approach: A variety ...Purpose: This paper explores a method of knowledge discovery by visualizing and analyzing co-occurrence relations among three or more entities in collections of journal articles.Design/methodology/approach: A variety of methods such as the model construction,system analysis and experiments are used. The author has improved Morris' crossmapping technique and developed a technique for directly describing,visualizing and analyzing co-occurrence relations among three or more entities in collections of journal articles.Findings: The visualization tools and the knowledge discovery method can efficiently reveal the multiple co-occurrence relations among three entities in collections of journal papers. It can reveal more and in-depth information than analyzing co-occurrence relations between two entities. Therefore,this method can be used for mapping knowledge domain that is manifested in association with the entities from multi-dimensional perspectives and in an all-round way.Research limitations: The technique could only be used to analyze co-occurrence relations of less than three entities at present.Practical implications: This research has expanded the study scope of co-occurrence analysis.The research result has provided a theoretical support for co-occurrence analysis.Originality/value: There has not been a systematic study on co-occurrence relations among multiple entities in collections of journal articles. This research defines multiple co-occurrence and the research scope,develops the visualization analysis tool and designs the analysis model of the knowledge discovery method.展开更多
This paper proposes the principle of comprehensive knowledge discovery.Unlike most of the current knowledge discovery methods,the comprehensive knowledge discovery considers both the spatial relations and attributes o...This paper proposes the principle of comprehensive knowledge discovery.Unlike most of the current knowledge discovery methods,the comprehensive knowledge discovery considers both the spatial relations and attributes of spatial entities or objects.We introduce the theory of spatial knowledge expression system and some concepts including comprehensive knowledge discovery and spatial union information table(SUIT).In theory,SUIT records all information contained in the studied objects,but in reality,because of the complexity and varieties of spatial relations,only those factors of interest to us are selected.In order to find out the comprehensive knowledge from spatial databases,an efficient comprehensive knowledge discovery algorithm called recycled algorithm(RAR)is suggested.展开更多
From the ecological viewpoint this paper discusses the urban spatial-temporal relationship. We take regional towns and cities as a complex man-land system of urban eco-community. This complex man-land system comprises...From the ecological viewpoint this paper discusses the urban spatial-temporal relationship. We take regional towns and cities as a complex man-land system of urban eco-community. This complex man-land system comprises two elements of ' man' and ' land' . Here, ' man' means organization with self-determined consciousness, and ' land' means the physical environment (niche) that ' man' depends on. The complex man-land system has three basic components. They are individual, population and community. Therefore there are six types of spatial relationship for the complex man-land system. They are individual, population,community,man-man, land-land and man-land spatial relationships. Taking the Pearl(Zhujiang) River Delta as a case study, the authors found some evidence of the urban spatial relationship from the remote sensing data. Firstly, the concentration and diffusion of the cities spatial relationship was found in the remote sensing imagery. Most of the cities concentrate in the core area of the Pearl River Delta, but the diffusion situation is also significant. Secondly, the growth behavior and succession behavior of the urban spatial relationship was found in the remote sensing images comparison with different temporal data. Thirdly, the inheritance, break, or meeting emergency behavior was observed from the remote sensing data. Fourthly, the authors found many cases of symbiosis and competition in the remote sensing data of the Pearl River Delta. Fifthly, the autoeciousness, stranglehold and invasion behavior of the urban spatial relationship was discovered from the remote sensing data.展开更多
LP (Logic Programming) has been successfully applied to knowledge discovery in many fields. The execution of the LP is based on the evaluation of the first order predicate. Usually the information involved in the pred...LP (Logic Programming) has been successfully applied to knowledge discovery in many fields. The execution of the LP is based on the evaluation of the first order predicate. Usually the information involved in the predicates are local and homogenous, thus the evaluation process is relatively simple. However, the evaluation process become much more complicated when applied to KDD on the Internet where the information involved in the predicates maybe heterogeneous and distributed over many different sits. Therefor, we try to attack the problem in a multi agent system's framework so that the logic program can be written in a site independent style and deal easily with heterogeneous represented information.展开更多
There are both associations and differences between structured and unstructured data mining. How to unite them together to be a united theoretical framework and to guide the research of knowledge discovery and data mi...There are both associations and differences between structured and unstructured data mining. How to unite them together to be a united theoretical framework and to guide the research of knowledge discovery and data mining has become an urgent problem to be solved. On the base of analysis and study of existing research results, the united model of knowledge discovery state space (UMKDSS) is presented, and the structured data mining and the complex type data mining are associated together. UMKDSS can provide theoretical guidance for complex type data mining. An application example of UMKDSS is given at last.展开更多
Important Dates Submission due November 15, 2005 Notification of acceptance December 30, 2005 Camera-ready copy due January 10, 2006 Workshop Scope Intelligence and Security Informatics (ISI) can be broadly defined as...Important Dates Submission due November 15, 2005 Notification of acceptance December 30, 2005 Camera-ready copy due January 10, 2006 Workshop Scope Intelligence and Security Informatics (ISI) can be broadly defined as the study of the development and use of advanced information technologies and systems for national and international security-related applications. The First and Second Symposiums on ISI were held in Tucson,Arizona,in 2003 and 2004,respectively. In 2005,the IEEE International Conference on ISI was held in Atlanta,Georgia. These ISI conferences have brought together academic researchers,law enforcement and intelligence experts,information technology consultant and practitioners to discuss their research and practice related to various ISI topics including ISI data management,data and text mining for ISI applications,terrorism informatics,deception detection,terrorist and criminal social network analysis,crime analysis,monitoring and surveillance,policy studies and evaluation,information assurance,among others. We continue this stream of ISI conferences by organizing the Workshop on Intelligence and Security Informatics (WISI’06) in conjunction with the Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD’06). WISI’06 will provide a stimulating forum for ISI researchers in Pacific Asia and other regions of the world to exchange ideas and report research progress. The workshop also welcomes contributions dealing with ISI challenges specific to the Pacific Asian region.展开更多
Since the early 1990, significant progress in database technology has provided new platform for emerging new dimensions of data engineering. New models were introduced to utilize the data sets stored in the new genera...Since the early 1990, significant progress in database technology has provided new platform for emerging new dimensions of data engineering. New models were introduced to utilize the data sets stored in the new generations of databases. These models have a deep impact on evolving decision-support systems. But they suffer a variety of practical problems while accessing real-world data sources. Specifically a type of data storage model based on data distribution theory has been increasingly used in recent years by large-scale enterprises, while it is not compatible with existing decision-support models. This data storage model stores the data in different geographical sites where they are more regularly accessed. This leads to considerably less inter-site data transfer that can reduce data security issues in some circumstances and also significantly improve data manipulation transactions speed. The aim of this paper is to propose a new approach for supporting proactive decision-making that utilizes a workable data source management methodology. The new model can effectively organize and use complex data sources, even when they are distributed in different sites in a fragmented form. At the same time, the new model provides a very high level of intellectual management decision-support by intelligent use of the data collections through utilizing new smart methods in synthesizing useful knowledge. The results of an empirical study to evaluate the model are provided.展开更多
A novel DNA coding based knowledge discovery algorithm was proposed, an example which verified its validity was given. It is proved that this algorithm can discover new simplified rules from the original rule set effi...A novel DNA coding based knowledge discovery algorithm was proposed, an example which verified its validity was given. It is proved that this algorithm can discover new simplified rules from the original rule set efficiently.展开更多
Purpose: The study was carried out to construct a domain knowledge service system based on the Scientific & Technological Knowledge Organization Systems(STKOS). Design/methodology/approach: The framework of a doma...Purpose: The study was carried out to construct a domain knowledge service system based on the Scientific & Technological Knowledge Organization Systems(STKOS). Design/methodology/approach: The framework of a domain knowledge service system is designed on the basis of the STKOS, and the STKOS science and technology vocabularies, category systems, and ontology networks are applied to realize the knowledge organization and semantic linking of the scientific and technological information resources. Meanwhile, related knowledge-mining analysis algorithms and models are improved, and some tools such as Solr and D3 are used for developing the system. This system integrates various knowledge service modules, including unified search of domain information resources and knowledge-linked navigation, domain hotspot and burst topics monitoring analysis, knowledge structure and evolution analysis, literature citation network, and research agents’ cooperative relationship network analysis. Findings: The system can help to refine descriptions, knowledge organization, and the semantic linking of various kinds of information resources closely related to science and technology. Such resources include domain literature, institutions, scientists, projects, and more. Research limitations: Trial assessment and performance improvement should be carried out for the knowledge service application on the basis of more types of and larger quantities of domain information resources.Practical implications: The domain knowledge service system provides an integrated knowledge discovery tool, as well as several kinds of knowledge mining analysis services for researchers.Originality/value: Our practice can be used as a valuable guide for libraries and information institutions that plan to provide deep domain knowledge services.展开更多
Based on the analysis of the existing ranking terminology or subject relevancy of documents methods through an intermediary collection as a catalyst(designated as Group B collection) for the purpose of of non-interact...Based on the analysis of the existing ranking terminology or subject relevancy of documents methods through an intermediary collection as a catalyst(designated as Group B collection) for the purpose of of non-interactive literature-based discovery, this article proposes a bi-directional document occurrence frequency based ranking method according to the 'concurrence theory' and the degree and extent of the subject relevancy. This method explores and further refines the ranking method that is based on the occurrence frequency of the usage of certain terminologies and documents and injects a new insightful perspective of the concurrence of appropriate terminologies/documents in the 'low occurrence frequency component' of three non-interactive document collections. A preliminary experiment was conducted to analyze and to test the significance and viability of our newly designed operational method.展开更多
Purpose: The late Don R. Swanson was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for ...Purpose: The late Don R. Swanson was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for 2000, and as author of many seminal articles. In this informal essay, I will give my personal perspective on Don's contributions to science, and outline some current and future directions in literature-based discovery that are rooted in concepts that he developed.Design/methodology/approach: Personal recollections and literature review. Findings: The Swanson A-B-C model of literature-based discovery has been successfully used by laboratory investigators analyzing their findings and hypotheses. It continues to be a fertile area of research in a wide range of application areas including text mining, drug repurposing, studies of scientific innovation, knowledge discovery in databases, and bioinformatics. Recently, additional modes of discovery that do not follow the A-B-C model have also been proposed and explored (e.g. so-called storytelling, gaps, analogies, link prediction, negative consensus, outliers, and revival of neglected or discarded research questions). Research limitations: This paper reflects the opinions of the author and is not a comprehensive nor technically based review of literature-based discovery. Practical implications: The general scientific public is still not aware of the availability of tools for literature-based discovery. Our Arrowsmith project site maintains a suite of discovery tools that are free and open to the public (http://arrowsmith.psych.uic.edu), as does BITOLA which is maintained by Dmitar Hristovski (http:// http://ibmi.mf.uni-lj.si/bitola), and Epiphanet which is maintained by Trevor Cohen (http://epiphanet.uth.tme.edu/). Bringing user-friendly tools to the public should be a high priority, since even more than advancing basic research in informatics, it is vital that we ensure that scientists actually use discovery tools and that these are actually able to help them make experimental discoveries in the lab and in the clinic. Originality/value: This paper discusses problems and issues which were inherent in Don's thoughts during his life, including those which have not yet been fully taken up and studied systematically.展开更多
文摘This review presents a comprehensive and forward-looking analysis of how Large Language Models(LLMs)are transforming knowledge discovery in the rational design of advancedmicro/nano electrocatalyst materials.Electrocatalysis is central to sustainable energy and environmental technologies,but traditional catalyst discovery is often hindered by high complexity,fragmented knowledge,and inefficiencies.LLMs,particularly those based on Transformer architectures,offer unprecedented capabilities in extracting,synthesizing,and generating scientific knowledge from vast unstructured textual corpora.This work provides the first structured synthesis of how LLMs have been leveraged across various electrocatalysis tasks,including automated information extraction from literature,text-based property prediction,hypothesis generation,synthesis planning,and knowledge graph construction.We comparatively analyze leading LLMs and domain-specific frameworks(e.g.,CatBERTa,CataLM,CatGPT)in terms of methodology,application scope,performance metrics,and limitations.Through curated case studies across key electrocatalytic reactions—HER,OER,ORR,and CO_(2)RR—we highlight emerging trends such as the growing use of embedding-based prediction,retrieval-augmented generation,and fine-tuned scientific LLMs.The review also identifies persistent challenges,including data heterogeneity,hallucination risks,lack of standard benchmarks,and limited multimodal integration.Importantly,we articulate future research directions,such as the development of multimodal and physics-informedMatSci-LLMs,enhanced interpretability tools,and the integration of LLMswith selfdriving laboratories for autonomous discovery.By consolidating fragmented advances and outlining a unified research roadmap,this review provides valuable guidance for both materials scientists and AI practitioners seeking to accelerate catalyst innovation through large language model technologies.
文摘A new structure of ESKD (expert system based on knowledge discovery system KD (D&K)) is first presented on the basis of KD (D&K)-a synthesized knowledge discovery system based on double-base (database and knowledge base) cooperating mechanism. With all new features, ESKD may form a new research direction and provide a great probability for solving the wealth of knowledge in the knowledge base. The general structural frame of ESKD and some sub-systems among ESKD have been described, and the dynamic knowledge base based on double-base cooperating mechanism has been emphased on. According to the result of demonstrative experi- ment, the structure of ESKD is effective and feasible.
文摘Recent developments in database technology have seen a wide variety of data being stored in huge collections. The wide variety makes the analysis tasks of a generic database a strenuous task in knowledge discovery. One approach is to summarize large datasets in such a way that the resulting summary dataset is of manageable size. Histogram has received significant attention as summarization/representative object for large database. But, it suffers from computational and space complexity. In this paper, we propose an idea to transform the histogram object into a Piecewise Linear Regression (PLR) line object and suggest that PLR objects can be less computational and storage intensive while compared to those of histograms. On the other hand to carry out a cluster analysis, we propose a distance measure for computing the distance between the PLR lines. Case study is presented based on the real data of online education system LMS. This demonstrates that PLR is a powerful knowledge representative for very large database.
文摘To improve the performance of the multiple classifier system, a new method of feature-decision level fusion is proposed based on knowledge discovery. In the new method, the base classifiers operate on different feature spaces and their types depend on different measures of between-class separability. The uncertainty measures corresponding to each output of each base classifier are induced from the established decision tables (DTs) in the form of mass function in the Dempster-Shafer theory (DST). Furthermore, an effective fusion framework is built at the feature-decision level on the basis of a generalized rough set model and the DST. The experiment for the classification of hyperspectral remote sensing images shows that the performance of the classification can be improved by the proposed method compared with that of plurality voting (PV).
文摘To discover the knowledge of fault diagnosis in maintenance record of flexible manufacture system(FMS) equipment. An algorithm (process) was presented, which consists of ① preparatory phase in which some items in maintenance record are selected and decomposed into associated concepts and attributes, and ② discovering and establishing process, in which some possible relationships between the concepts and attributes can be established and knowledge is formulated. The rich diagnosis knowledge in maintenance record was captured through applying the method. An application of the method to the diagnosis system for FMS equipment showed that the approach is correct and effective.
基金[This work was financially supported by the National Natural Science Foundation of China (No. 69835001).]
文摘A new algorithm for the knowledge discovery based on statistic inductionlogic is proposed, and the validity of the methods is verified by examples. The method is suitablefor a large range of knowledge discovery applications in the studying of causal relation,uncertainty knowledge acquisition and principal factors analyzing. The language filed description ofthe state space makes the algorithm robust in the adaptation with easier understandable results,which are isomotopy with natural language in the topologic space.
文摘The 1st International Conference on Data-driven Knowledge Discovery: When Data Science Meets Information Science took place at the National Science Library (NSL), Chinese Academy of Sciences (CAS) in Beijing from June 19 till June 22, 2016. The Conference was opened by NSL Director Xiangyang Huang, who placed the event within the goals of the Library, and lauded the spirit of intemational collaboration in the area of data science and knowledge discovery. The whole event was an encouraging success with over 370 registered participants and highly enlightening presentations. The Conference was organized by the Journal of Data andlnformation Science (JDIS) to bring the Joumal to the attention of an international and local audience.
文摘The present article outlines progress made in designing an intelligent information system for automatic management and knowledge discovery in large numeric and scientific databases, with a validating application to the CAST-NEONS environmental databases used for ocean modeling and prediction. We describe a discovery-learning process (Automatic Data Analysis System) which combines the features of two machine learning techniques to generate sets of production rules that efficiently describe the observational raw data contained in the database. Data clustering allows the system to classify the raw data into meaningful conceptual clusters, which the system learns by induction to build decision trees, from which are automatically deduced the production rules.
文摘In the current biomedical data movement, numerous efforts have been made to convert and normalize a large number of traditional structured and unstructured data (e.g., EHRs, reports) to semi-structured data (e.g., RDF, OWL). With the increasing number of semi-structured data coming into the biomedical community, data integration and knowledge discovery from heterogeneous domains become important research problem. In the application level, detection of related concepts among medical ontologies is an important goal of life science research. It is more crucial to figure out how different concepts are related within a single ontology or across multiple ontologies by analysing predicates in different knowledge bases. However, the world today is one of information explosion, and it is extremely difficult for biomedical researchers to find existing or potential predicates to perform linking among cross domain concepts without any support from schema pattern analysis. Therefore, there is a need for a mechanism to do predicate oriented pattern analysis to partition heterogeneous ontologies into closer small topics and do query generation to discover cross domain knowledge from each topic. In this paper, we present such a model that predicates oriented pattern analysis based on their close relationship and generates a similarity matrix. Based on this similarity matrix, we apply an innovated unsupervised learning algorithm to partition large data sets into smaller and closer topics and generate meaningful queries to fully discover knowledge over a set of interlinked data sources. We have implemented a prototype system named BmQGen and evaluate the proposed model with colorectal surgical cohort from the Mayo Clinic.
文摘Purpose: This paper explores a method of knowledge discovery by visualizing and analyzing co-occurrence relations among three or more entities in collections of journal articles.Design/methodology/approach: A variety of methods such as the model construction,system analysis and experiments are used. The author has improved Morris' crossmapping technique and developed a technique for directly describing,visualizing and analyzing co-occurrence relations among three or more entities in collections of journal articles.Findings: The visualization tools and the knowledge discovery method can efficiently reveal the multiple co-occurrence relations among three entities in collections of journal papers. It can reveal more and in-depth information than analyzing co-occurrence relations between two entities. Therefore,this method can be used for mapping knowledge domain that is manifested in association with the entities from multi-dimensional perspectives and in an all-round way.Research limitations: The technique could only be used to analyze co-occurrence relations of less than three entities at present.Practical implications: This research has expanded the study scope of co-occurrence analysis.The research result has provided a theoretical support for co-occurrence analysis.Originality/value: There has not been a systematic study on co-occurrence relations among multiple entities in collections of journal articles. This research defines multiple co-occurrence and the research scope,develops the visualization analysis tool and designs the analysis model of the knowledge discovery method.
基金the China’s National Surveying Technical Fund(No.20007)
文摘This paper proposes the principle of comprehensive knowledge discovery.Unlike most of the current knowledge discovery methods,the comprehensive knowledge discovery considers both the spatial relations and attributes of spatial entities or objects.We introduce the theory of spatial knowledge expression system and some concepts including comprehensive knowledge discovery and spatial union information table(SUIT).In theory,SUIT records all information contained in the studied objects,but in reality,because of the complexity and varieties of spatial relations,only those factors of interest to us are selected.In order to find out the comprehensive knowledge from spatial databases,an efficient comprehensive knowledge discovery algorithm called recycled algorithm(RAR)is suggested.
基金Under the auspices of the National Natural Science Foundation of China(No.69896250-4).
文摘From the ecological viewpoint this paper discusses the urban spatial-temporal relationship. We take regional towns and cities as a complex man-land system of urban eco-community. This complex man-land system comprises two elements of ' man' and ' land' . Here, ' man' means organization with self-determined consciousness, and ' land' means the physical environment (niche) that ' man' depends on. The complex man-land system has three basic components. They are individual, population and community. Therefore there are six types of spatial relationship for the complex man-land system. They are individual, population,community,man-man, land-land and man-land spatial relationships. Taking the Pearl(Zhujiang) River Delta as a case study, the authors found some evidence of the urban spatial relationship from the remote sensing data. Firstly, the concentration and diffusion of the cities spatial relationship was found in the remote sensing imagery. Most of the cities concentrate in the core area of the Pearl River Delta, but the diffusion situation is also significant. Secondly, the growth behavior and succession behavior of the urban spatial relationship was found in the remote sensing images comparison with different temporal data. Thirdly, the inheritance, break, or meeting emergency behavior was observed from the remote sensing data. Fourthly, the authors found many cases of symbiosis and competition in the remote sensing data of the Pearl River Delta. Fifthly, the autoeciousness, stranglehold and invasion behavior of the urban spatial relationship was discovered from the remote sensing data.
文摘LP (Logic Programming) has been successfully applied to knowledge discovery in many fields. The execution of the LP is based on the evaluation of the first order predicate. Usually the information involved in the predicates are local and homogenous, thus the evaluation process is relatively simple. However, the evaluation process become much more complicated when applied to KDD on the Internet where the information involved in the predicates maybe heterogeneous and distributed over many different sits. Therefor, we try to attack the problem in a multi agent system's framework so that the logic program can be written in a site independent style and deal easily with heterogeneous represented information.
文摘There are both associations and differences between structured and unstructured data mining. How to unite them together to be a united theoretical framework and to guide the research of knowledge discovery and data mining has become an urgent problem to be solved. On the base of analysis and study of existing research results, the united model of knowledge discovery state space (UMKDSS) is presented, and the structured data mining and the complex type data mining are associated together. UMKDSS can provide theoretical guidance for complex type data mining. An application example of UMKDSS is given at last.
文摘Important Dates Submission due November 15, 2005 Notification of acceptance December 30, 2005 Camera-ready copy due January 10, 2006 Workshop Scope Intelligence and Security Informatics (ISI) can be broadly defined as the study of the development and use of advanced information technologies and systems for national and international security-related applications. The First and Second Symposiums on ISI were held in Tucson,Arizona,in 2003 and 2004,respectively. In 2005,the IEEE International Conference on ISI was held in Atlanta,Georgia. These ISI conferences have brought together academic researchers,law enforcement and intelligence experts,information technology consultant and practitioners to discuss their research and practice related to various ISI topics including ISI data management,data and text mining for ISI applications,terrorism informatics,deception detection,terrorist and criminal social network analysis,crime analysis,monitoring and surveillance,policy studies and evaluation,information assurance,among others. We continue this stream of ISI conferences by organizing the Workshop on Intelligence and Security Informatics (WISI’06) in conjunction with the Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD’06). WISI’06 will provide a stimulating forum for ISI researchers in Pacific Asia and other regions of the world to exchange ideas and report research progress. The workshop also welcomes contributions dealing with ISI challenges specific to the Pacific Asian region.
文摘Since the early 1990, significant progress in database technology has provided new platform for emerging new dimensions of data engineering. New models were introduced to utilize the data sets stored in the new generations of databases. These models have a deep impact on evolving decision-support systems. But they suffer a variety of practical problems while accessing real-world data sources. Specifically a type of data storage model based on data distribution theory has been increasingly used in recent years by large-scale enterprises, while it is not compatible with existing decision-support models. This data storage model stores the data in different geographical sites where they are more regularly accessed. This leads to considerably less inter-site data transfer that can reduce data security issues in some circumstances and also significantly improve data manipulation transactions speed. The aim of this paper is to propose a new approach for supporting proactive decision-making that utilizes a workable data source management methodology. The new model can effectively organize and use complex data sources, even when they are distributed in different sites in a fragmented form. At the same time, the new model provides a very high level of intellectual management decision-support by intelligent use of the data collections through utilizing new smart methods in synthesizing useful knowledge. The results of an empirical study to evaluate the model are provided.
文摘A novel DNA coding based knowledge discovery algorithm was proposed, an example which verified its validity was given. It is proved that this algorithm can discover new simplified rules from the original rule set efficiently.
基金supported by the Ministry of Science and Technology of China(Project No.:2011BAH10B06)
文摘Purpose: The study was carried out to construct a domain knowledge service system based on the Scientific & Technological Knowledge Organization Systems(STKOS). Design/methodology/approach: The framework of a domain knowledge service system is designed on the basis of the STKOS, and the STKOS science and technology vocabularies, category systems, and ontology networks are applied to realize the knowledge organization and semantic linking of the scientific and technological information resources. Meanwhile, related knowledge-mining analysis algorithms and models are improved, and some tools such as Solr and D3 are used for developing the system. This system integrates various knowledge service modules, including unified search of domain information resources and knowledge-linked navigation, domain hotspot and burst topics monitoring analysis, knowledge structure and evolution analysis, literature citation network, and research agents’ cooperative relationship network analysis. Findings: The system can help to refine descriptions, knowledge organization, and the semantic linking of various kinds of information resources closely related to science and technology. Such resources include domain literature, institutions, scientists, projects, and more. Research limitations: Trial assessment and performance improvement should be carried out for the knowledge service application on the basis of more types of and larger quantities of domain information resources.Practical implications: The domain knowledge service system provides an integrated knowledge discovery tool, as well as several kinds of knowledge mining analysis services for researchers.Originality/value: Our practice can be used as a valuable guide for libraries and information institutions that plan to provide deep domain knowledge services.
基金supported by Humanities and Social Science Foundation of Ministry of Education of China(Grant No.07JA870005)
文摘Based on the analysis of the existing ranking terminology or subject relevancy of documents methods through an intermediary collection as a catalyst(designated as Group B collection) for the purpose of of non-interactive literature-based discovery, this article proposes a bi-directional document occurrence frequency based ranking method according to the 'concurrence theory' and the degree and extent of the subject relevancy. This method explores and further refines the ranking method that is based on the occurrence frequency of the usage of certain terminologies and documents and injects a new insightful perspective of the concurrence of appropriate terminologies/documents in the 'low occurrence frequency component' of three non-interactive document collections. A preliminary experiment was conducted to analyze and to test the significance and viability of our newly designed operational method.
基金supported by NIH grants R01LM010817 and P01AG039347
文摘Purpose: The late Don R. Swanson was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for 2000, and as author of many seminal articles. In this informal essay, I will give my personal perspective on Don's contributions to science, and outline some current and future directions in literature-based discovery that are rooted in concepts that he developed.Design/methodology/approach: Personal recollections and literature review. Findings: The Swanson A-B-C model of literature-based discovery has been successfully used by laboratory investigators analyzing their findings and hypotheses. It continues to be a fertile area of research in a wide range of application areas including text mining, drug repurposing, studies of scientific innovation, knowledge discovery in databases, and bioinformatics. Recently, additional modes of discovery that do not follow the A-B-C model have also been proposed and explored (e.g. so-called storytelling, gaps, analogies, link prediction, negative consensus, outliers, and revival of neglected or discarded research questions). Research limitations: This paper reflects the opinions of the author and is not a comprehensive nor technically based review of literature-based discovery. Practical implications: The general scientific public is still not aware of the availability of tools for literature-based discovery. Our Arrowsmith project site maintains a suite of discovery tools that are free and open to the public (http://arrowsmith.psych.uic.edu), as does BITOLA which is maintained by Dmitar Hristovski (http:// http://ibmi.mf.uni-lj.si/bitola), and Epiphanet which is maintained by Trevor Cohen (http://epiphanet.uth.tme.edu/). Bringing user-friendly tools to the public should be a high priority, since even more than advancing basic research in informatics, it is vital that we ensure that scientists actually use discovery tools and that these are actually able to help them make experimental discoveries in the lab and in the clinic. Originality/value: This paper discusses problems and issues which were inherent in Don's thoughts during his life, including those which have not yet been fully taken up and studied systematically.