The 1st International Conference on Data-driven Knowledge Discovery: When Data Science Meets Information Science took place at the National Science Library (NSL), Chinese Academy of Sciences (CAS) in Beijing from...The 1st International Conference on Data-driven Knowledge Discovery: When Data Science Meets Information Science took place at the National Science Library (NSL), Chinese Academy of Sciences (CAS) in Beijing from June 19 till June 22, 2016. The Conference was opened by NSL Director Xiangyang Huang, who placed the event within the goals of the Library, and lauded the spirit of intemational collaboration in the area of data science and knowledge discovery. The whole event was an encouraging success with over 370 registered participants and highly enlightening presentations. The Conference was organized by the Journal of Data andlnformation Science (JDIS) to bring the Joumal to the attention of an international and local audience.展开更多
Important Dates Submission due November 15, 2005 Notification of acceptance December 30, 2005 Camera-ready copy due January 10, 2006 Workshop Scope Intelligence and Security Informatics (ISI) can be broadly defined as...Important Dates Submission due November 15, 2005 Notification of acceptance December 30, 2005 Camera-ready copy due January 10, 2006 Workshop Scope Intelligence and Security Informatics (ISI) can be broadly defined as the study of the development and use of advanced information technologies and systems for national and international security-related applications. The First and Second Symposiums on ISI were held in Tucson,Arizona,in 2003 and 2004,respectively. In 2005,the IEEE International Conference on ISI was held in Atlanta,Georgia. These ISI conferences have brought together academic researchers,law enforcement and intelligence experts,information technology consultant and practitioners to discuss their research and practice related to various ISI topics including ISI data management,data and text mining for ISI applications,terrorism informatics,deception detection,terrorist and criminal social network analysis,crime analysis,monitoring and surveillance,policy studies and evaluation,information assurance,among others. We continue this stream of ISI conferences by organizing the Workshop on Intelligence and Security Informatics (WISI’06) in conjunction with the Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD’06). WISI’06 will provide a stimulating forum for ISI researchers in Pacific Asia and other regions of the world to exchange ideas and report research progress. The workshop also welcomes contributions dealing with ISI challenges specific to the Pacific Asian region.展开更多
Since the early 1990, significant progress in database technology has provided new platform for emerging new dimensions of data engineering. New models were introduced to utilize the data sets stored in the new genera...Since the early 1990, significant progress in database technology has provided new platform for emerging new dimensions of data engineering. New models were introduced to utilize the data sets stored in the new generations of databases. These models have a deep impact on evolving decision-support systems. But they suffer a variety of practical problems while accessing real-world data sources. Specifically a type of data storage model based on data distribution theory has been increasingly used in recent years by large-scale enterprises, while it is not compatible with existing decision-support models. This data storage model stores the data in different geographical sites where they are more regularly accessed. This leads to considerably less inter-site data transfer that can reduce data security issues in some circumstances and also significantly improve data manipulation transactions speed. The aim of this paper is to propose a new approach for supporting proactive decision-making that utilizes a workable data source management methodology. The new model can effectively organize and use complex data sources, even when they are distributed in different sites in a fragmented form. At the same time, the new model provides a very high level of intellectual management decision-support by intelligent use of the data collections through utilizing new smart methods in synthesizing useful knowledge. The results of an empirical study to evaluate the model are provided.展开更多
This paper proposes the principle of comprehensive knowledge discovery.Unlike most of the current knowledge discovery methods,the comprehensive knowledge discovery considers both the spatial relations and attributes o...This paper proposes the principle of comprehensive knowledge discovery.Unlike most of the current knowledge discovery methods,the comprehensive knowledge discovery considers both the spatial relations and attributes of spatial entities or objects.We introduce the theory of spatial knowledge expression system and some concepts including comprehensive knowledge discovery and spatial union information table(SUIT).In theory,SUIT records all information contained in the studied objects,but in reality,because of the complexity and varieties of spatial relations,only those factors of interest to us are selected.In order to find out the comprehensive knowledge from spatial databases,an efficient comprehensive knowledge discovery algorithm called recycled algorithm(RAR)is suggested.展开更多
It is common in industrial construction projects for data to be collected and discarded without being analyzed to extract useful knowledge. A proposed integrated methodology based on a five-step Knowledge Discovery in...It is common in industrial construction projects for data to be collected and discarded without being analyzed to extract useful knowledge. A proposed integrated methodology based on a five-step Knowledge Discovery in Data (KDD) model was developed to address this issue. The framework transfers existing multidimensional historical data from completed projects into useful knowledge for future projects. The model starts by understanding the problem domain, industrial construction projects. The second step is analyzing the problem data and its multiple dimensions. The target dataset is the labour resources data generated while managing industrial construction projects. The next step is developing the data collection model and prototype data ware-house. The data warehouse stores collected data in a ready-for-mining format and produces dynamic On Line Analytical Processing (OLAP) reports and graphs. Data was collected from a large western-Canadian structural steel fabricator to prove the applicability of the developed methodology. The proposed framework was applied to three different case studies to validate the applicability of the developed framework to real projects data.展开更多
A large number of ontologies have been introduced by the biomedical community in recent years. Knowledge discovery for entity identification from ontology has become an important research area, and it is always intere...A large number of ontologies have been introduced by the biomedical community in recent years. Knowledge discovery for entity identification from ontology has become an important research area, and it is always interesting to discovery how associations are established to connect concepts in a single ontology or across multiple ontologies. However, due to the exponential growth of biomedical big data and their complicated associations, it becomes very challenging to detect key associations among entities in an inefficient dynamic manner. Therefore, there exists a gap between the increasing needs for association detection and large volume of biomedical ontologies. In this paper, to bridge this gap, we presented a knowledge discovery framework, the BioBroker, for grouping entities to facilitate the process of biomedical knowledge discovery in an intelligent way. Specifically, we developed an innovative knowledge discovery algorithm that combines a graph clustering method and an indexing technique to discovery knowledge patterns over a set of interlinked data sources in an efficient way. We have demonstrated capabilities of the BioBroker for query execution with a use case study on a subset of the Bio2RDF life science linked data.展开更多
There are both associations and differences between structured and unstructured data mining. How to unite them together to be a united theoretical framework and to guide the research of knowledge discovery and data mi...There are both associations and differences between structured and unstructured data mining. How to unite them together to be a united theoretical framework and to guide the research of knowledge discovery and data mining has become an urgent problem to be solved. On the base of analysis and study of existing research results, the united model of knowledge discovery state space (UMKDSS) is presented, and the structured data mining and the complex type data mining are associated together. UMKDSS can provide theoretical guidance for complex type data mining. An application example of UMKDSS is given at last.展开更多
Recent developments in database technology have seen a wide variety of data being stored in huge collections. The wide variety makes the analysis tasks of a generic database a strenuous task in knowledge discovery. On...Recent developments in database technology have seen a wide variety of data being stored in huge collections. The wide variety makes the analysis tasks of a generic database a strenuous task in knowledge discovery. One approach is to summarize large datasets in such a way that the resulting summary dataset is of manageable size. Histogram has received significant attention as summarization/representative object for large database. But, it suffers from computational and space complexity. In this paper, we propose an idea to transform the histogram object into a Piecewise Linear Regression (PLR) line object and suggest that PLR objects can be less computational and storage intensive while compared to those of histograms. On the other hand to carry out a cluster analysis, we propose a distance measure for computing the distance between the PLR lines. Case study is presented based on the real data of online education system LMS. This demonstrates that PLR is a powerful knowledge representative for very large database.展开更多
Knowledge discovery, as an increasingly adopted information technology in biomedical science, has shown great promise in the field of Traditional Chinese Medicine (TCM). In this paper, we provided a kind of multidimen...Knowledge discovery, as an increasingly adopted information technology in biomedical science, has shown great promise in the field of Traditional Chinese Medicine (TCM). In this paper, we provided a kind of multidimensional table which was well suited for organizing and analyzing the data in ancient Chinese books on Materia Medica. Moreover, we demonstrated its capability of facilitating further mining works in TCM through two illustrative studies of discovering meaningful patterns in the three-dimensional table of Shennong’s Classic of Materia Medica. This work might provide an appropriate data model for the development of knowledge discovery in TCM.展开更多
Structural choice is a significant decision having an important influence on structural function, social economics, structural reliability and construction cost. A Case Based Reasoning system with its retrieval part c...Structural choice is a significant decision having an important influence on structural function, social economics, structural reliability and construction cost. A Case Based Reasoning system with its retrieval part constructed with a KDD subsystem, is put forward to make a decision for a large scale engineering project. A typical CBR system consists of four parts: case representation, case retriever, evaluation, and adaptation. A case library is a set of parameterized excellent and successful structures. For a structural choice, the key point is that the system must be able to detect the pattern classes hidden in the case library and classify the input parameters into classes properly. That is done by using the KDD Data Mining algorithm based on Self Organizing Feature Maps (SOFM), which makes the whole system more adaptive, self organizing, self learning and open.展开更多
With massive amounts of data stored in databases, mining information and knowledge in databases has become an important issue in recent research. Researchers in many different fields have shown great interest in data ...With massive amounts of data stored in databases, mining information and knowledge in databases has become an important issue in recent research. Researchers in many different fields have shown great interest in data mining and knowledge discovery in databases. Several emerging applications in information providing services, such as data warehousing and on-line services over the Internet, also call for various data mining and knowledge discovery techniques to understand user behavior better, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a comprehensive survey on the data mining and knowledge discovery techniques developed recently, and introduce some real application systems as well. In conclusion, this article also lists some problems and challenges for further research.展开更多
Tsinghua Science and Technology is founded and published since 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date ...Tsinghua Science and Technology is founded and published since 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, and other information technology fields. It is indexed by Ei and other abstracting and indexing services. From 2013, the journal commits to the open access at IEEE Xplore Digital Library.展开更多
It is important for telecom companies to make sense of the large number of data they have accumulated over the years. This paper reviews the concepts and the techniques of knowledge discovery in databases (KDD), and s...It is important for telecom companies to make sense of the large number of data they have accumulated over the years. This paper reviews the concepts and the techniques of knowledge discovery in databases (KDD), and surveys applications of this technology in the telecommunications sector all over the world. It also discusses some possible applications of this technology in China, and reports a preliminary result of the first attempt to apply KDD technique in telephone traffic volume prediction. It concludes that KDD is a promising technology that can help to enhance-the competitiveness of China's telecom companies in the face of looming competition in a liberated market.展开更多
An integrated solution for discovery of literature information knowledge is proposed. The analytic model of literature Information model and discovery of literature information knowledge are illustrated. Practical ill...An integrated solution for discovery of literature information knowledge is proposed. The analytic model of literature Information model and discovery of literature information knowledge are illustrated. Practical illustrative example for discovery of literature information knowledge is given.展开更多
[目的/意义]科技文献蕴含丰富的领域知识与科学数据,可为人工智能驱动的科学研究(AI for Science,AI4S)提供高质量数据支撑。本文系统梳理大语言模型(Large Language Models,LLMs)在科技文献数据挖掘中的方法技术、软件工具及应用场景,...[目的/意义]科技文献蕴含丰富的领域知识与科学数据,可为人工智能驱动的科学研究(AI for Science,AI4S)提供高质量数据支撑。本文系统梳理大语言模型(Large Language Models,LLMs)在科技文献数据挖掘中的方法技术、软件工具及应用场景,探讨其研究方向与发展趋势。[方法/过程]本文基于文献调研与归纳总结,在方法技术层面,从文本知识、科学数据与图表信息分析了LLMs驱动的科技文献细粒度数据挖掘关键技术以及综合性知识生成的方法;在软件工具层面,归纳了主流LLMs科技文献数据挖掘与知识生成工具的方法技术、核心功能和适用场景;在应用场景层面,分析了科技文献数据挖掘应用于LLMs的实践价值。[结果/结论]在方法技术方面,通过动态提示学习框架与领域适配微调等技术,LLMs极大提升科技文献数据挖掘精度与效度;在软件工具方面,已初步形成从数据标注、数据挖掘、合成数据到知识生成的全流程LLMs科技文献数据挖掘工具链;在应用方面,科技文献数据可为LLMs提供专业化语料和高质量数据,LLMs推动科技文献从单维数据服务向多模态知识生成服务的范式演进。然而,当前仍面临领域知识表征深度不足、跨模态推理效率较低、知识生成可解释性欠缺等挑战。未来应着重研发具有可解释性与跨领域适应性的LLMs科技文献数据挖掘工具,集成“人在回路”的协同机制,促进科技文献数据挖掘从效率优化向知识创造转变。展开更多
Knowledge Discovery in Databases is gaining attention and raising new hopes for traditional Chinese medicine (TCM) researchers. It is a useful tool in understanding and deciphering TCM theories. Aiming for a better ...Knowledge Discovery in Databases is gaining attention and raising new hopes for traditional Chinese medicine (TCM) researchers. It is a useful tool in understanding and deciphering TCM theories. Aiming for a better understanding of Chinese herbal property theory (CHPT), this paper performed an improved association rule learning to analyze semistructured text in the book entitled Shennong's Classic of Materia Medica. The text was firstly annotated and transformed to well-structured multidimensional data. Subsequently, an Apriori algorithm was employed for producing association rules after the sensitivity analysis of parameters. From the confirmed 120 resulting rules that described the intrinsic relationships between herbal property (qi, flavor and their combinations) and herbal efficacy, two novel fundamental principles underlying CHPT were acquired and further elucidated: (1) the many-to-one mapping of herbal efficacy to herbal property; (2) the nonrandom overlap between the related efficacy of qi and flavor. This work provided an innovative knowledge about CHPT, which would be helpful for its modern research.展开更多
The objective,connotations and research issues of big geodata mining were discussed to address its significance to geographical research in this paper.Big geodata may be categorized into two domains:big earth observat...The objective,connotations and research issues of big geodata mining were discussed to address its significance to geographical research in this paper.Big geodata may be categorized into two domains:big earth observation data and big human behavior data.A description of big geodata includes,in addition to the“5Vs”(volume,velocity,value,variety and veracity),a further five features,that is,granularity,scope,density,skewness and precision.Based on this approach,the essence of mining big geodata includes four aspects.First,flow space,where flow replaces points in traditional space,will become the new presentation form for big human behavior data.Second,the objectives for mining big geodata are the spatial patterns and the spatial relationships.Third,the spatiotemporal distributions of big geodata can be viewed as overlays of multiple geographic patterns and the characteristics of the data,namely heterogeneity and homogeneity,may change with scale.Fourth,data mining can be seen as a tool for discovery of geographic patterns and the patterns revealed may be attributed to human-land relationships.The big geodata mining methods may be categorized into two types in view of the mining objective,i.e.,classification mining and relationship mining.Future research will be faced by a number of issues,including the aggregation and connection of big geodata,the effective evaluation of the mining results and the challenge for mining to reveal“non-trivial”knowledge.展开更多
文摘The 1st International Conference on Data-driven Knowledge Discovery: When Data Science Meets Information Science took place at the National Science Library (NSL), Chinese Academy of Sciences (CAS) in Beijing from June 19 till June 22, 2016. The Conference was opened by NSL Director Xiangyang Huang, who placed the event within the goals of the Library, and lauded the spirit of intemational collaboration in the area of data science and knowledge discovery. The whole event was an encouraging success with over 370 registered participants and highly enlightening presentations. The Conference was organized by the Journal of Data andlnformation Science (JDIS) to bring the Joumal to the attention of an international and local audience.
文摘Important Dates Submission due November 15, 2005 Notification of acceptance December 30, 2005 Camera-ready copy due January 10, 2006 Workshop Scope Intelligence and Security Informatics (ISI) can be broadly defined as the study of the development and use of advanced information technologies and systems for national and international security-related applications. The First and Second Symposiums on ISI were held in Tucson,Arizona,in 2003 and 2004,respectively. In 2005,the IEEE International Conference on ISI was held in Atlanta,Georgia. These ISI conferences have brought together academic researchers,law enforcement and intelligence experts,information technology consultant and practitioners to discuss their research and practice related to various ISI topics including ISI data management,data and text mining for ISI applications,terrorism informatics,deception detection,terrorist and criminal social network analysis,crime analysis,monitoring and surveillance,policy studies and evaluation,information assurance,among others. We continue this stream of ISI conferences by organizing the Workshop on Intelligence and Security Informatics (WISI’06) in conjunction with the Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD’06). WISI’06 will provide a stimulating forum for ISI researchers in Pacific Asia and other regions of the world to exchange ideas and report research progress. The workshop also welcomes contributions dealing with ISI challenges specific to the Pacific Asian region.
文摘Since the early 1990, significant progress in database technology has provided new platform for emerging new dimensions of data engineering. New models were introduced to utilize the data sets stored in the new generations of databases. These models have a deep impact on evolving decision-support systems. But they suffer a variety of practical problems while accessing real-world data sources. Specifically a type of data storage model based on data distribution theory has been increasingly used in recent years by large-scale enterprises, while it is not compatible with existing decision-support models. This data storage model stores the data in different geographical sites where they are more regularly accessed. This leads to considerably less inter-site data transfer that can reduce data security issues in some circumstances and also significantly improve data manipulation transactions speed. The aim of this paper is to propose a new approach for supporting proactive decision-making that utilizes a workable data source management methodology. The new model can effectively organize and use complex data sources, even when they are distributed in different sites in a fragmented form. At the same time, the new model provides a very high level of intellectual management decision-support by intelligent use of the data collections through utilizing new smart methods in synthesizing useful knowledge. The results of an empirical study to evaluate the model are provided.
基金the China’s National Surveying Technical Fund(No.20007)
文摘This paper proposes the principle of comprehensive knowledge discovery.Unlike most of the current knowledge discovery methods,the comprehensive knowledge discovery considers both the spatial relations and attributes of spatial entities or objects.We introduce the theory of spatial knowledge expression system and some concepts including comprehensive knowledge discovery and spatial union information table(SUIT).In theory,SUIT records all information contained in the studied objects,but in reality,because of the complexity and varieties of spatial relations,only those factors of interest to us are selected.In order to find out the comprehensive knowledge from spatial databases,an efficient comprehensive knowledge discovery algorithm called recycled algorithm(RAR)is suggested.
文摘It is common in industrial construction projects for data to be collected and discarded without being analyzed to extract useful knowledge. A proposed integrated methodology based on a five-step Knowledge Discovery in Data (KDD) model was developed to address this issue. The framework transfers existing multidimensional historical data from completed projects into useful knowledge for future projects. The model starts by understanding the problem domain, industrial construction projects. The second step is analyzing the problem data and its multiple dimensions. The target dataset is the labour resources data generated while managing industrial construction projects. The next step is developing the data collection model and prototype data ware-house. The data warehouse stores collected data in a ready-for-mining format and produces dynamic On Line Analytical Processing (OLAP) reports and graphs. Data was collected from a large western-Canadian structural steel fabricator to prove the applicability of the developed methodology. The proposed framework was applied to three different case studies to validate the applicability of the developed framework to real projects data.
文摘A large number of ontologies have been introduced by the biomedical community in recent years. Knowledge discovery for entity identification from ontology has become an important research area, and it is always interesting to discovery how associations are established to connect concepts in a single ontology or across multiple ontologies. However, due to the exponential growth of biomedical big data and their complicated associations, it becomes very challenging to detect key associations among entities in an inefficient dynamic manner. Therefore, there exists a gap between the increasing needs for association detection and large volume of biomedical ontologies. In this paper, to bridge this gap, we presented a knowledge discovery framework, the BioBroker, for grouping entities to facilitate the process of biomedical knowledge discovery in an intelligent way. Specifically, we developed an innovative knowledge discovery algorithm that combines a graph clustering method and an indexing technique to discovery knowledge patterns over a set of interlinked data sources in an efficient way. We have demonstrated capabilities of the BioBroker for query execution with a use case study on a subset of the Bio2RDF life science linked data.
文摘There are both associations and differences between structured and unstructured data mining. How to unite them together to be a united theoretical framework and to guide the research of knowledge discovery and data mining has become an urgent problem to be solved. On the base of analysis and study of existing research results, the united model of knowledge discovery state space (UMKDSS) is presented, and the structured data mining and the complex type data mining are associated together. UMKDSS can provide theoretical guidance for complex type data mining. An application example of UMKDSS is given at last.
文摘Recent developments in database technology have seen a wide variety of data being stored in huge collections. The wide variety makes the analysis tasks of a generic database a strenuous task in knowledge discovery. One approach is to summarize large datasets in such a way that the resulting summary dataset is of manageable size. Histogram has received significant attention as summarization/representative object for large database. But, it suffers from computational and space complexity. In this paper, we propose an idea to transform the histogram object into a Piecewise Linear Regression (PLR) line object and suggest that PLR objects can be less computational and storage intensive while compared to those of histograms. On the other hand to carry out a cluster analysis, we propose a distance measure for computing the distance between the PLR lines. Case study is presented based on the real data of online education system LMS. This demonstrates that PLR is a powerful knowledge representative for very large database.
文摘Knowledge discovery, as an increasingly adopted information technology in biomedical science, has shown great promise in the field of Traditional Chinese Medicine (TCM). In this paper, we provided a kind of multidimensional table which was well suited for organizing and analyzing the data in ancient Chinese books on Materia Medica. Moreover, we demonstrated its capability of facilitating further mining works in TCM through two illustrative studies of discovering meaningful patterns in the three-dimensional table of Shennong’s Classic of Materia Medica. This work might provide an appropriate data model for the development of knowledge discovery in TCM.
文摘Structural choice is a significant decision having an important influence on structural function, social economics, structural reliability and construction cost. A Case Based Reasoning system with its retrieval part constructed with a KDD subsystem, is put forward to make a decision for a large scale engineering project. A typical CBR system consists of four parts: case representation, case retriever, evaluation, and adaptation. A case library is a set of parameterized excellent and successful structures. For a structural choice, the key point is that the system must be able to detect the pattern classes hidden in the case library and classify the input parameters into classes properly. That is done by using the KDD Data Mining algorithm based on Self Organizing Feature Maps (SOFM), which makes the whole system more adaptive, self organizing, self learning and open.
文摘With massive amounts of data stored in databases, mining information and knowledge in databases has become an important issue in recent research. Researchers in many different fields have shown great interest in data mining and knowledge discovery in databases. Several emerging applications in information providing services, such as data warehousing and on-line services over the Internet, also call for various data mining and knowledge discovery techniques to understand user behavior better, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a comprehensive survey on the data mining and knowledge discovery techniques developed recently, and introduce some real application systems as well. In conclusion, this article also lists some problems and challenges for further research.
文摘Tsinghua Science and Technology is founded and published since 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, and other information technology fields. It is indexed by Ei and other abstracting and indexing services. From 2013, the journal commits to the open access at IEEE Xplore Digital Library.
文摘It is important for telecom companies to make sense of the large number of data they have accumulated over the years. This paper reviews the concepts and the techniques of knowledge discovery in databases (KDD), and surveys applications of this technology in the telecommunications sector all over the world. It also discusses some possible applications of this technology in China, and reports a preliminary result of the first attempt to apply KDD technique in telephone traffic volume prediction. It concludes that KDD is a promising technology that can help to enhance-the competitiveness of China's telecom companies in the face of looming competition in a liberated market.
文摘An integrated solution for discovery of literature information knowledge is proposed. The analytic model of literature Information model and discovery of literature information knowledge are illustrated. Practical illustrative example for discovery of literature information knowledge is given.
文摘[目的/意义]科技文献蕴含丰富的领域知识与科学数据,可为人工智能驱动的科学研究(AI for Science,AI4S)提供高质量数据支撑。本文系统梳理大语言模型(Large Language Models,LLMs)在科技文献数据挖掘中的方法技术、软件工具及应用场景,探讨其研究方向与发展趋势。[方法/过程]本文基于文献调研与归纳总结,在方法技术层面,从文本知识、科学数据与图表信息分析了LLMs驱动的科技文献细粒度数据挖掘关键技术以及综合性知识生成的方法;在软件工具层面,归纳了主流LLMs科技文献数据挖掘与知识生成工具的方法技术、核心功能和适用场景;在应用场景层面,分析了科技文献数据挖掘应用于LLMs的实践价值。[结果/结论]在方法技术方面,通过动态提示学习框架与领域适配微调等技术,LLMs极大提升科技文献数据挖掘精度与效度;在软件工具方面,已初步形成从数据标注、数据挖掘、合成数据到知识生成的全流程LLMs科技文献数据挖掘工具链;在应用方面,科技文献数据可为LLMs提供专业化语料和高质量数据,LLMs推动科技文献从单维数据服务向多模态知识生成服务的范式演进。然而,当前仍面临领域知识表征深度不足、跨模态推理效率较低、知识生成可解释性欠缺等挑战。未来应着重研发具有可解释性与跨领域适应性的LLMs科技文献数据挖掘工具,集成“人在回路”的协同机制,促进科技文献数据挖掘从效率优化向知识创造转变。
文摘Knowledge Discovery in Databases is gaining attention and raising new hopes for traditional Chinese medicine (TCM) researchers. It is a useful tool in understanding and deciphering TCM theories. Aiming for a better understanding of Chinese herbal property theory (CHPT), this paper performed an improved association rule learning to analyze semistructured text in the book entitled Shennong's Classic of Materia Medica. The text was firstly annotated and transformed to well-structured multidimensional data. Subsequently, an Apriori algorithm was employed for producing association rules after the sensitivity analysis of parameters. From the confirmed 120 resulting rules that described the intrinsic relationships between herbal property (qi, flavor and their combinations) and herbal efficacy, two novel fundamental principles underlying CHPT were acquired and further elucidated: (1) the many-to-one mapping of herbal efficacy to herbal property; (2) the nonrandom overlap between the related efficacy of qi and flavor. This work provided an innovative knowledge about CHPT, which would be helpful for its modern research.
基金National Natural Science Foundation of China,No.41525004,No.41421001。
文摘The objective,connotations and research issues of big geodata mining were discussed to address its significance to geographical research in this paper.Big geodata may be categorized into two domains:big earth observation data and big human behavior data.A description of big geodata includes,in addition to the“5Vs”(volume,velocity,value,variety and veracity),a further five features,that is,granularity,scope,density,skewness and precision.Based on this approach,the essence of mining big geodata includes four aspects.First,flow space,where flow replaces points in traditional space,will become the new presentation form for big human behavior data.Second,the objectives for mining big geodata are the spatial patterns and the spatial relationships.Third,the spatiotemporal distributions of big geodata can be viewed as overlays of multiple geographic patterns and the characteristics of the data,namely heterogeneity and homogeneity,may change with scale.Fourth,data mining can be seen as a tool for discovery of geographic patterns and the patterns revealed may be attributed to human-land relationships.The big geodata mining methods may be categorized into two types in view of the mining objective,i.e.,classification mining and relationship mining.Future research will be faced by a number of issues,including the aggregation and connection of big geodata,the effective evaluation of the mining results and the challenge for mining to reveal“non-trivial”knowledge.