To achieve high-quality economic development,it is imperative to prioritize the real economy and foster new factors for economic growth.Data,as a new factor of production,plays a pivotal role in facilitating the seaml...To achieve high-quality economic development,it is imperative to prioritize the real economy and foster new factors for economic growth.Data,as a new factor of production,plays a pivotal role in facilitating the seamless integration between digital technology and the real economy.It possesses inherent attributes and techno-economic characteristics that enable the extraction of value across various processes,including production,transaction,consumption,and regulatory supervision.The integration with digital technology enhances the productivity and efficiency of the real economy by facilitating service sector digitalization,accelerating the growth of the new real economy,and supporting the virtual economy in its role of serving the real economy.At present,unleashing the value of data is hindered by inadequate fundamental systems for the data,a lack of activity in the transaction market,and the underutilization of the data as a factor of production by enterprises in the real economy.Therefore,it is advised that data be fully utilized to develop the real economy through the four-pronged approach of“enhancing support for the high-quality provision of the data,expediting the integration of the data into the real economy,promoting the high-quality development of the real economy,and enhancing public service and governance systems”.展开更多
To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,t...To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,the question classifier draws both semantic and grammatical information into information retrieval and machine learning methods in the form of various training features,including the question word,the main verb of the question,the dependency structure,the position of the main auxiliary verb,the main noun of the question,the top hypernym of the main noun,etc.Then the QA query results are re-ranked by question class information.Experiments show that the questions in real-world web data sets can be accurately classified by the classifier,and the QA results after re-ranking can be obviously improved.It is proved that with both semantic and grammatical information,applications such as QA, built upon real-world web data sets, can be improved,thus showing better performance.展开更多
As for the satellite remote sensing data obtained by the visible and infrared bands myers,on, the clouds coverage in the sky over the ocean often results in missing data of inversion products on a large scale, and thi...As for the satellite remote sensing data obtained by the visible and infrared bands myers,on, the clouds coverage in the sky over the ocean often results in missing data of inversion products on a large scale, and thin clouds difficult to be detected would cause the data of the inversion products to be abnormal. Alvera et a1.(2005) proposed a method for the reconstruction of missing data based on an Empirical Orthogonal Functions (EOF) decomposition, but his method couldn't process these images presenting extreme cloud coverage(more than 95%), and required a long time for recon- struction. Besides, the abnormal data in the images had a great effect on the reconstruction result. Therefore, this paper tries to improve the study result. It has reconstructed missing data sets by twice applying EOF decomposition method. Firstly, the abnormity time has been detected by analyzing the temporal modes of EOF decomposition, and the abnormal data have been eliminated. Secondly, the data sets, excluding the abnormal data, are analyzed by using EOF decomposition, and then the temporal modes undergo a filtering process so as to enhance the ability of reconstruct- ing the images which are of no or just a little data, by using EOF. At last, this method has been applied to a large data set, i.e. 43 Sea Surface Temperature (SST) satellite images of the Changjiang River (Yangtze River) estuary and its adjacent areas, and the total reconstruction root mean square error (RMSE) is 0.82℃. And it has been proved that this improved EOF reconstruction method is robust for reconstructing satellite missing data and unreliable data.展开更多
In this paper, we consider the problem of the evaluation of system reliability using statistical data obtained from reliability tests of its elements, in which the lifetimes of elements are described using an exponent...In this paper, we consider the problem of the evaluation of system reliability using statistical data obtained from reliability tests of its elements, in which the lifetimes of elements are described using an exponential distribution. We assume that this lifetime data may be reported imprecisely and that this lack of precision may be described using fuzzy sets. As the direct application of the fuzzy sets methodology leads in this case to very complicated and time consuming calculations, we propose simple approximations of fuzzy numbers using shadowed sets introduced by Pedrycz (1998). The proposed methodology may be simply extended to the case of general lifetime probability distributions.展开更多
Recently,much interest has been given tomulti-granulation rough sets (MGRS), and various types ofMGRSmodelshave been developed from different viewpoints. In this paper, we introduce two techniques for the classificati...Recently,much interest has been given tomulti-granulation rough sets (MGRS), and various types ofMGRSmodelshave been developed from different viewpoints. In this paper, we introduce two techniques for the classificationof MGRS. Firstly, we generate multi-topologies from multi-relations defined in the universe. Hence, a novelapproximation space is established by leveraging the underlying topological structure. The characteristics of thenewly proposed approximation space are discussed.We introduce an algorithmfor the reduction ofmulti-relations.Secondly, a new approach for the classification ofMGRS based on neighborhood concepts is introduced. Finally, areal-life application from medical records is introduced via our approach to the classification of MGRS.展开更多
A novel binary particle swarm optimization for frequent item sets mining from high-dimensional dataset(BPSO-HD) was proposed, where two improvements were joined. Firstly, the dimensionality reduction of initial partic...A novel binary particle swarm optimization for frequent item sets mining from high-dimensional dataset(BPSO-HD) was proposed, where two improvements were joined. Firstly, the dimensionality reduction of initial particles was designed to ensure the reasonable initial fitness, and then, the dynamically dimensionality cutting of dataset was built to decrease the search space. Based on four high-dimensional datasets, BPSO-HD was compared with Apriori to test its reliability, and was compared with the ordinary BPSO and quantum swarm evolutionary(QSE) to prove its advantages. The experiments show that the results given by BPSO-HD is reliable and better than the results generated by BPSO and QSE.展开更多
The main goal of this research is to assess the impact of race, age at diagnosis, sex, and phenotype on the incidence and survivability of acute lymphocytic leukemia (ALL) among patients in the United States. By takin...The main goal of this research is to assess the impact of race, age at diagnosis, sex, and phenotype on the incidence and survivability of acute lymphocytic leukemia (ALL) among patients in the United States. By taking these factors into account, the study aims to explore how existing cancer registry data can aid in the early detection and effective treatment of ALL in patients. Our hypothesis was that statistically significant correlations exist between race, age at which patients were diagnosed, sex, and phenotype of the ALL patients, and their rate of incidence and survivability data were evaluated using SEER*Stat statistical software from National Cancer Institute. Analysis of the incidence data revealed that a higher prevalence of ALL was among the Caucasian population. The majority of ALL cases (59%) occurred in patients aged between 0 to 19 years at the time of diagnosis, and 56% of the affected individuals were male. The B-cell phenotype was predominantly associated with ALL cases (73%). When analyzing survivability data, it was observed that the 5-year survival rates slightly exceeded the 10-year survival rates for the respective demographics. Survivability rates of African Americans patients were the lowest compared to Caucasian, Asian, Pacific Islanders, Alaskan Native, Native Americans and others. Survivability rates progressively decreased for older patients. Moreover, this study investigated the typical treatment methods applied to ALL patients, mainly comprising chemotherapy, with occasional supplementation of radiation therapy as required. The study demonstrated the considerable efficacy of chemotherapy in enhancing patients’ chances of survival, while those who remained untreated faced a less favorable prognosis from the disease. Although a significant amount of data and information exists, this study can help doctors in the future by diagnosing patients with certain characteristics. It will further assist the health care professionals in screening potential patients and early detection of cases. This could also save the lives of elderly patients who have a higher mortality rate from this disease.展开更多
Data mining (also known as Knowledge Discovery in Databases - KDD) is defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The aims and objectives of data...Data mining (also known as Knowledge Discovery in Databases - KDD) is defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The aims and objectives of data mining are to discover knowledge of interest to user needs.Data mining is really a useful tool in many domains such as marketing, decision making, etc. However, some basic issues of data mining are ignored. What is data mining? What is the product of a data mining process? What are we doing in a data mining process? Is there any rule we should obey in a data mining process? In order to discover patterns and knowledge really interesting and actionable to the real world Zhang et al proposed a domain-driven human-machine-cooperated data mining process.Zhao and Yao proposed an interactive user-driven classification method using the granule network. In our work, we find that data mining is a kind of knowledge transforming process to transform knowledge from data format into symbol format. Thus, no new knowledge could be generated (born) in a data mining process. In a data mining process, knowledge is just transformed from data format, which is not understandable for human, into symbol format,which is understandable for human and easy to be used.It is similar to the process of translating a book from Chinese into English.In this translating process,the knowledge itself in the book should remain unchanged. What will be changed is the format of the knowledge only. That is, the knowledge in the English book should be kept the same as the knowledge in the Chinese one.Otherwise, there must be some mistakes in the translating proces, that is, we are transforming knowledge from one format into another format while not producing new knowledge in a data mining process. The knowledge is originally stored in data (data is a representation format of knowledge). Unfortunately, we can not read, understand, or use it, since we can not understand data. With this understanding of data mining, we proposed a data-driven knowledge acquisition method based on rough sets. It also improved the performance of classical knowledge acquisition methods. In fact, we also find that the domain-driven data mining and user-driven data mining do not conflict with our data-driven data mining. They could be integrated into domain-oriented data-driven data mining. It is just like the views of data base. Users with different views could look at different partial data of a data base. Thus, users with different tasks or objectives wish, or could discover different knowledge (partial knowledge) from the same data base. However, all these partial knowledge should be originally existed in the data base. So, a domain-oriented data-driven data mining method would help us to extract the knowledge which is really existed in a data base, and really interesting and actionable to the real world.展开更多
In the transition of China’s economy from high-speed growth to high-quality growth in the new era,economic practices are oriented to fostering new growth drivers,developing new industries,and forming new models.Based...In the transition of China’s economy from high-speed growth to high-quality growth in the new era,economic practices are oriented to fostering new growth drivers,developing new industries,and forming new models.Based on the data flow,big data effectively integrates technology,material,fund,and human resource flows and reveals new paths for the development of new growth drivers,new industries and new models.Adopting an analytical framework with"macro-meso-micro"levels,this paper elaborates on the theoretical mechanisms by which big data drives high-quality growth through efficiency improvements,upgrades of industrial structures,and business model innovations.It also explores the practical foundations for big data driven high-quality growth including technological advancements of big data,the development of big data industries,and the formulation of big data strategies.Finally,this paper proposes policy options for big data promoting high-quality growth in terms of developing digital economy,consolidating the infrastructure construction of big data,expediting convergence of big data and the real economy,advocating for a big data culture,and expanding financing options for big data.展开更多
A rough set probabilistic data association(RS-PDA)algorithm is proposed for reducing the complexity and time consumption of data association and enhancing the accuracy of tracking results in multi-target tracking appl...A rough set probabilistic data association(RS-PDA)algorithm is proposed for reducing the complexity and time consumption of data association and enhancing the accuracy of tracking results in multi-target tracking application.In this new algorithm,the measurements lying in the intersection of two or more validation regions are allocated to the corresponding targets through rough set theory,and the multi-target tracking problem is transformed into a single target tracking after the classification of measurements lying in the intersection region.Several typical multi-target tracking applications are given.The simulation results show that the algorithm can not only reduce the complexity and time consumption but also enhance the accuracy and stability of the tracking results.展开更多
An attempt of applying a novel genetic programming(GP) technique,a new member of evolution algorithms,has been made to predict the water storage of Wolonghu wetland response to the climate change in northeastern part ...An attempt of applying a novel genetic programming(GP) technique,a new member of evolution algorithms,has been made to predict the water storage of Wolonghu wetland response to the climate change in northeastern part of China with little data set.Fourteen years(1993-2006) of annual water storage and climatic data set of the wetland were taken for model training and testing.The results of simulations and predictions illustrated a good fit between calculated water storage and observed values(MAPE=9.47,r=0.99).By comparison,a multilayer perceptron(MLP)(a popular artificial neural network model) method and a grey model(GM) with the same data set were applied for performances estimation.It was found that GP technique had better performances than the other two methods both in the simulation step and predicting phase and the results were analyzed and discussed.The case study confirmed that GP method is a promising way for wetland managers to make a quick estimation of fluctuations of water storage in some wetlands under condition of little data set.展开更多
In gene prediction, the Fisher discriminant analysis (FDA) is used to separate protein coding region (exon) from non-coding regions (intron). Usually, the positive data set and the negative data set are of the same si...In gene prediction, the Fisher discriminant analysis (FDA) is used to separate protein coding region (exon) from non-coding regions (intron). Usually, the positive data set and the negative data set are of the same size if the number of the data is big enough. But for some situations the data are not sufficient or not equal, the threshold used in FDA may have important influence on prediction results. This paper presents a study on the selection of the threshold. The eigen value of each exon/intron sequence is computed using the Z-curve method with 69 variables. The experiments results suggest that the size and the standard deviation of the data sets and the threshold are the three key elements to be taken into consideration to improve the prediction results.展开更多
Arctic region is experiencing strong warming and related changes in the state of sea ice, permafrost, tundra, marine environment and terrestrial ecosystems. These changes are found in any climatological data set compr...Arctic region is experiencing strong warming and related changes in the state of sea ice, permafrost, tundra, marine environment and terrestrial ecosystems. These changes are found in any climatological data set comprising the Arctic region. This study compares the temperature trends in several surface, satellite and reanalysis data sets. We demonstrate large differences in the 1979-2002 temperature trends. Data sets disagree on the magnitude of the trends as well as on their seasonal, zonal and vertical pattern. It was found that the surface temperature trends are stronger than the trends in the tropospheric temperature for each latitude band north of 50?N for each month except for the months during the ice-melting season. These results emphasize that the conclusions of climate studies drawn on the basis of a single data set analysis should be treated with caution as they may be affected by the artificial biases in data.展开更多
Industry is the main industry of the real economy,the main body of scientific and technological innovation,and essential support for improving the quality and efficiency of the regional economy.In recent years,Guizhou...Industry is the main industry of the real economy,the main body of scientific and technological innovation,and essential support for improving the quality and efficiency of the regional economy.In recent years,Guizhou’s industry has achieved efficiency,motivation,and change by relying on three strategic actions:big data,significant poverty alleviation,and big ecology,which has promoted the development of Guizhou’s s industrial economy.Guided by the ideas and concepts of high-quality development,this paper intends to construct the evaluation index system and evaluation model for the high-quality development of the industrial economy.This paper evaluates the high-quality development of the industrial economy in Guizhou Province from both vertical and horizontal aspects.Vertically,Guizhou in recent years in terms of big data,big ecology,and poverty alleviation strategy under the development trend is good;horizontally,among the four provinces and cities in southwestern China,Guizhou Province has the lowest average score in the past five years,and it is urgent to improve the talent policy and promote the upgrading of industrial structure.Finally,combined with relevant policies,countermeasures and suggestions are put forward for the shortboard of industrial development in Guizhou to promote the high-quality development of Guizhou’s industry.展开更多
Data reconstruction is a crucial step in seismic data preprocessing.To improve reconstruction speed and save memory,the commonly used three-dimensional(3D)seismic data reconstruction method divides the missing data in...Data reconstruction is a crucial step in seismic data preprocessing.To improve reconstruction speed and save memory,the commonly used three-dimensional(3D)seismic data reconstruction method divides the missing data into a series of time slices and independently reconstructs each time slice.However,when this strategy is employed,the potential correlations between two adjacent time slices are ignored,which degrades reconstruction performance.Therefore,this study proposes the use of a two-dimensional curvelet transform and the fast iterative shrinkage thresholding algorithm for data reconstruction.Based on the significant overlapping characteristics between the curvelet coefficient support sets of two adjacent time slices,a weighted operator is constructed in the curvelet domain using the prior support set provided by the previous reconstructed time slice to delineate the main energy distribution range,eff ectively providing prior information for reconstructing adjacent slices.Consequently,the resulting weighted fast iterative shrinkage thresholding algorithm can be used to reconstruct 3D seismic data.The processing of synthetic and field data shows that the proposed method has higher reconstruction accuracy and faster computational speed than the conventional fast iterative shrinkage thresholding algorithm for handling missing 3D seismic data.展开更多
Multi-view clustering is a critical research area in computer science aimed at effectively extracting meaningful patterns from complex,high-dimensional data that single-view methods cannot capture.Traditional fuzzy cl...Multi-view clustering is a critical research area in computer science aimed at effectively extracting meaningful patterns from complex,high-dimensional data that single-view methods cannot capture.Traditional fuzzy clustering techniques,such as Fuzzy C-Means(FCM),face significant challenges in handling uncertainty and the dependencies between different views.To overcome these limitations,we introduce a new multi-view fuzzy clustering approach that integrates picture fuzzy sets with a dual-anchor graph method for multi-view data,aiming to enhance clustering accuracy and robustness,termed Multi-view Picture Fuzzy Clustering(MPFC).In particular,the picture fuzzy set theory extends the capability to represent uncertainty by modeling three membership levels:membership degrees,neutral degrees,and refusal degrees.This allows for a more flexible representation of uncertain and conflicting data than traditional fuzzy models.Meanwhile,dual-anchor graphs exploit the similarity relationships between data points and integrate information across views.This combination improves stability,scalability,and robustness when handling noisy and heterogeneous data.Experimental results on several benchmark datasets demonstrate significant improvements in clustering accuracy and efficiency,outperforming traditional methods.Specifically,the MPFC algorithm demonstrates outstanding clustering performance on a variety of datasets,attaining a Purity(PUR)score of 0.6440 and an Accuracy(ACC)score of 0.6213 for the 3 Sources dataset,underscoring its robustness and efficiency.The proposed approach significantly contributes to fields such as pattern recognition,multi-view relational data analysis,and large-scale clustering problems.Future work will focus on extending the method for semi-supervised multi-view clustering,aiming to enhance adaptability,scalability,and performance in real-world applications.展开更多
This article states the poor database which is very common when being used them. So the demanding database must be all-round, effective collection. When the offering database is poor database, it will affect the appli...This article states the poor database which is very common when being used them. So the demanding database must be all-round, effective collection. When the offering database is poor database, it will affect the application of Supporter Deciding. To this question, the author brings out one solution to solve the poor database basing on the Rough Sets Theory. It can scientifically, correctly, effectively supplement the poor database, and can offer greatly help to enforce the application of data and artificial intelligence.展开更多
文摘To achieve high-quality economic development,it is imperative to prioritize the real economy and foster new factors for economic growth.Data,as a new factor of production,plays a pivotal role in facilitating the seamless integration between digital technology and the real economy.It possesses inherent attributes and techno-economic characteristics that enable the extraction of value across various processes,including production,transaction,consumption,and regulatory supervision.The integration with digital technology enhances the productivity and efficiency of the real economy by facilitating service sector digitalization,accelerating the growth of the new real economy,and supporting the virtual economy in its role of serving the real economy.At present,unleashing the value of data is hindered by inadequate fundamental systems for the data,a lack of activity in the transaction market,and the underutilization of the data as a factor of production by enterprises in the real economy.Therefore,it is advised that data be fully utilized to develop the real economy through the four-pronged approach of“enhancing support for the high-quality provision of the data,expediting the integration of the data into the real economy,promoting the high-quality development of the real economy,and enhancing public service and governance systems”.
基金Microsoft Research Asia Internet Services in Academic Research Fund(No.FY07-RES-OPP-116)the Science and Technology Development Program of Tianjin(No.06YFGZGX05900)
文摘To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,the question classifier draws both semantic and grammatical information into information retrieval and machine learning methods in the form of various training features,including the question word,the main verb of the question,the dependency structure,the position of the main auxiliary verb,the main noun of the question,the top hypernym of the main noun,etc.Then the QA query results are re-ranked by question class information.Experiments show that the questions in real-world web data sets can be accurately classified by the classifier,and the QA results after re-ranking can be obviously improved.It is proved that with both semantic and grammatical information,applications such as QA, built upon real-world web data sets, can be improved,thus showing better performance.
基金The National Natural Science Foundation of China under contract Nos 40576080 and 40506036 the National"863" Project of China under contract No 2007AA12Z182
文摘As for the satellite remote sensing data obtained by the visible and infrared bands myers,on, the clouds coverage in the sky over the ocean often results in missing data of inversion products on a large scale, and thin clouds difficult to be detected would cause the data of the inversion products to be abnormal. Alvera et a1.(2005) proposed a method for the reconstruction of missing data based on an Empirical Orthogonal Functions (EOF) decomposition, but his method couldn't process these images presenting extreme cloud coverage(more than 95%), and required a long time for recon- struction. Besides, the abnormal data in the images had a great effect on the reconstruction result. Therefore, this paper tries to improve the study result. It has reconstructed missing data sets by twice applying EOF decomposition method. Firstly, the abnormity time has been detected by analyzing the temporal modes of EOF decomposition, and the abnormal data have been eliminated. Secondly, the data sets, excluding the abnormal data, are analyzed by using EOF decomposition, and then the temporal modes undergo a filtering process so as to enhance the ability of reconstruct- ing the images which are of no or just a little data, by using EOF. At last, this method has been applied to a large data set, i.e. 43 Sea Surface Temperature (SST) satellite images of the Changjiang River (Yangtze River) estuary and its adjacent areas, and the total reconstruction root mean square error (RMSE) is 0.82℃. And it has been proved that this improved EOF reconstruction method is robust for reconstructing satellite missing data and unreliable data.
文摘In this paper, we consider the problem of the evaluation of system reliability using statistical data obtained from reliability tests of its elements, in which the lifetimes of elements are described using an exponential distribution. We assume that this lifetime data may be reported imprecisely and that this lack of precision may be described using fuzzy sets. As the direct application of the fuzzy sets methodology leads in this case to very complicated and time consuming calculations, we propose simple approximations of fuzzy numbers using shadowed sets introduced by Pedrycz (1998). The proposed methodology may be simply extended to the case of general lifetime probability distributions.
文摘Recently,much interest has been given tomulti-granulation rough sets (MGRS), and various types ofMGRSmodelshave been developed from different viewpoints. In this paper, we introduce two techniques for the classificationof MGRS. Firstly, we generate multi-topologies from multi-relations defined in the universe. Hence, a novelapproximation space is established by leveraging the underlying topological structure. The characteristics of thenewly proposed approximation space are discussed.We introduce an algorithmfor the reduction ofmulti-relations.Secondly, a new approach for the classification ofMGRS based on neighborhood concepts is introduced. Finally, areal-life application from medical records is introduced via our approach to the classification of MGRS.
文摘A novel binary particle swarm optimization for frequent item sets mining from high-dimensional dataset(BPSO-HD) was proposed, where two improvements were joined. Firstly, the dimensionality reduction of initial particles was designed to ensure the reasonable initial fitness, and then, the dynamically dimensionality cutting of dataset was built to decrease the search space. Based on four high-dimensional datasets, BPSO-HD was compared with Apriori to test its reliability, and was compared with the ordinary BPSO and quantum swarm evolutionary(QSE) to prove its advantages. The experiments show that the results given by BPSO-HD is reliable and better than the results generated by BPSO and QSE.
文摘The main goal of this research is to assess the impact of race, age at diagnosis, sex, and phenotype on the incidence and survivability of acute lymphocytic leukemia (ALL) among patients in the United States. By taking these factors into account, the study aims to explore how existing cancer registry data can aid in the early detection and effective treatment of ALL in patients. Our hypothesis was that statistically significant correlations exist between race, age at which patients were diagnosed, sex, and phenotype of the ALL patients, and their rate of incidence and survivability data were evaluated using SEER*Stat statistical software from National Cancer Institute. Analysis of the incidence data revealed that a higher prevalence of ALL was among the Caucasian population. The majority of ALL cases (59%) occurred in patients aged between 0 to 19 years at the time of diagnosis, and 56% of the affected individuals were male. The B-cell phenotype was predominantly associated with ALL cases (73%). When analyzing survivability data, it was observed that the 5-year survival rates slightly exceeded the 10-year survival rates for the respective demographics. Survivability rates of African Americans patients were the lowest compared to Caucasian, Asian, Pacific Islanders, Alaskan Native, Native Americans and others. Survivability rates progressively decreased for older patients. Moreover, this study investigated the typical treatment methods applied to ALL patients, mainly comprising chemotherapy, with occasional supplementation of radiation therapy as required. The study demonstrated the considerable efficacy of chemotherapy in enhancing patients’ chances of survival, while those who remained untreated faced a less favorable prognosis from the disease. Although a significant amount of data and information exists, this study can help doctors in the future by diagnosing patients with certain characteristics. It will further assist the health care professionals in screening potential patients and early detection of cases. This could also save the lives of elderly patients who have a higher mortality rate from this disease.
文摘Data mining (also known as Knowledge Discovery in Databases - KDD) is defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The aims and objectives of data mining are to discover knowledge of interest to user needs.Data mining is really a useful tool in many domains such as marketing, decision making, etc. However, some basic issues of data mining are ignored. What is data mining? What is the product of a data mining process? What are we doing in a data mining process? Is there any rule we should obey in a data mining process? In order to discover patterns and knowledge really interesting and actionable to the real world Zhang et al proposed a domain-driven human-machine-cooperated data mining process.Zhao and Yao proposed an interactive user-driven classification method using the granule network. In our work, we find that data mining is a kind of knowledge transforming process to transform knowledge from data format into symbol format. Thus, no new knowledge could be generated (born) in a data mining process. In a data mining process, knowledge is just transformed from data format, which is not understandable for human, into symbol format,which is understandable for human and easy to be used.It is similar to the process of translating a book from Chinese into English.In this translating process,the knowledge itself in the book should remain unchanged. What will be changed is the format of the knowledge only. That is, the knowledge in the English book should be kept the same as the knowledge in the Chinese one.Otherwise, there must be some mistakes in the translating proces, that is, we are transforming knowledge from one format into another format while not producing new knowledge in a data mining process. The knowledge is originally stored in data (data is a representation format of knowledge). Unfortunately, we can not read, understand, or use it, since we can not understand data. With this understanding of data mining, we proposed a data-driven knowledge acquisition method based on rough sets. It also improved the performance of classical knowledge acquisition methods. In fact, we also find that the domain-driven data mining and user-driven data mining do not conflict with our data-driven data mining. They could be integrated into domain-oriented data-driven data mining. It is just like the views of data base. Users with different views could look at different partial data of a data base. Thus, users with different tasks or objectives wish, or could discover different knowledge (partial knowledge) from the same data base. However, all these partial knowledge should be originally existed in the data base. So, a domain-oriented data-driven data mining method would help us to extract the knowledge which is really existed in a data base, and really interesting and actionable to the real world.
基金funded by the Program for “Sanqin Scholar Innovation Teams in Shanxi Province”(SZTZ [2018] No.34)“the Research on the Mechanism,Effect Evaluation,and Policy Support of Replacing Business Tax with VAT In Promoting the Industrial Structure Upgrade of China” funded by the Humanity and Social Science Youth Foundation of the Ministry of Education of China(18YJC790078)“the Evaluation and Study of the Effect of Promoting Industrial Transformation and Upgrade of Shaanxi by Replacing Business Tax with Value-added Tax” funded by the Social Science Foundation Project of Shanxi Province(2017D037)
文摘In the transition of China’s economy from high-speed growth to high-quality growth in the new era,economic practices are oriented to fostering new growth drivers,developing new industries,and forming new models.Based on the data flow,big data effectively integrates technology,material,fund,and human resource flows and reveals new paths for the development of new growth drivers,new industries and new models.Adopting an analytical framework with"macro-meso-micro"levels,this paper elaborates on the theoretical mechanisms by which big data drives high-quality growth through efficiency improvements,upgrades of industrial structures,and business model innovations.It also explores the practical foundations for big data driven high-quality growth including technological advancements of big data,the development of big data industries,and the formulation of big data strategies.Finally,this paper proposes policy options for big data promoting high-quality growth in terms of developing digital economy,consolidating the infrastructure construction of big data,expediting convergence of big data and the real economy,advocating for a big data culture,and expanding financing options for big data.
基金Supported by National Natural Science Foundation of China(60675039)National High Technology Research and Development Program of China(863 Program)(2006AA04Z217)Hundred Talents Program of Chinese Academy of Sciences
文摘A rough set probabilistic data association(RS-PDA)algorithm is proposed for reducing the complexity and time consumption of data association and enhancing the accuracy of tracking results in multi-target tracking application.In this new algorithm,the measurements lying in the intersection of two or more validation regions are allocated to the corresponding targets through rough set theory,and the multi-target tracking problem is transformed into a single target tracking after the classification of measurements lying in the intersection region.Several typical multi-target tracking applications are given.The simulation results show that the algorithm can not only reduce the complexity and time consumption but also enhance the accuracy and stability of the tracking results.
基金Sponsored by the National Basic Research Program of China(Grant No. 2006CB403302)the National Education Ministry foundation of China(Grant No.705011)the National Special Science and Technology Program Water Pollution Control and Treatment (Grant No.2009ZX07526-006,2008AX07208-001)
文摘An attempt of applying a novel genetic programming(GP) technique,a new member of evolution algorithms,has been made to predict the water storage of Wolonghu wetland response to the climate change in northeastern part of China with little data set.Fourteen years(1993-2006) of annual water storage and climatic data set of the wetland were taken for model training and testing.The results of simulations and predictions illustrated a good fit between calculated water storage and observed values(MAPE=9.47,r=0.99).By comparison,a multilayer perceptron(MLP)(a popular artificial neural network model) method and a grey model(GM) with the same data set were applied for performances estimation.It was found that GP technique had better performances than the other two methods both in the simulation step and predicting phase and the results were analyzed and discussed.The case study confirmed that GP method is a promising way for wetland managers to make a quick estimation of fluctuations of water storage in some wetlands under condition of little data set.
文摘In gene prediction, the Fisher discriminant analysis (FDA) is used to separate protein coding region (exon) from non-coding regions (intron). Usually, the positive data set and the negative data set are of the same size if the number of the data is big enough. But for some situations the data are not sufficient or not equal, the threshold used in FDA may have important influence on prediction results. This paper presents a study on the selection of the threshold. The eigen value of each exon/intron sequence is computed using the Z-curve method with 69 variables. The experiments results suggest that the size and the standard deviation of the data sets and the threshold are the three key elements to be taken into consideration to improve the prediction results.
文摘Arctic region is experiencing strong warming and related changes in the state of sea ice, permafrost, tundra, marine environment and terrestrial ecosystems. These changes are found in any climatological data set comprising the Arctic region. This study compares the temperature trends in several surface, satellite and reanalysis data sets. We demonstrate large differences in the 1979-2002 temperature trends. Data sets disagree on the magnitude of the trends as well as on their seasonal, zonal and vertical pattern. It was found that the surface temperature trends are stronger than the trends in the tropospheric temperature for each latitude band north of 50?N for each month except for the months during the ice-melting season. These results emphasize that the conclusions of climate studies drawn on the basis of a single data set analysis should be treated with caution as they may be affected by the artificial biases in data.
基金supported by the Guizhou Philosophy and Social Science Planning Project(Grant No.19GZYB04)
文摘Industry is the main industry of the real economy,the main body of scientific and technological innovation,and essential support for improving the quality and efficiency of the regional economy.In recent years,Guizhou’s industry has achieved efficiency,motivation,and change by relying on three strategic actions:big data,significant poverty alleviation,and big ecology,which has promoted the development of Guizhou’s s industrial economy.Guided by the ideas and concepts of high-quality development,this paper intends to construct the evaluation index system and evaluation model for the high-quality development of the industrial economy.This paper evaluates the high-quality development of the industrial economy in Guizhou Province from both vertical and horizontal aspects.Vertically,Guizhou in recent years in terms of big data,big ecology,and poverty alleviation strategy under the development trend is good;horizontally,among the four provinces and cities in southwestern China,Guizhou Province has the lowest average score in the past five years,and it is urgent to improve the talent policy and promote the upgrading of industrial structure.Finally,combined with relevant policies,countermeasures and suggestions are put forward for the shortboard of industrial development in Guizhou to promote the high-quality development of Guizhou’s industry.
基金National Natural Science Foundation of China under Grant 42304145Jiangxi Provincial Natural Science Foundation under Grant 20242BAB26051,20242BAB25191 and 20232BAB213077+1 种基金Foundation of National Key Laboratory of Uranium Resources Exploration-Mining and Nuclear Remote Sensing under Grant 2024QZ-TD-13Open Fund(FW0399-0002)of SINOPEC Key Laboratory of Geophysics。
文摘Data reconstruction is a crucial step in seismic data preprocessing.To improve reconstruction speed and save memory,the commonly used three-dimensional(3D)seismic data reconstruction method divides the missing data into a series of time slices and independently reconstructs each time slice.However,when this strategy is employed,the potential correlations between two adjacent time slices are ignored,which degrades reconstruction performance.Therefore,this study proposes the use of a two-dimensional curvelet transform and the fast iterative shrinkage thresholding algorithm for data reconstruction.Based on the significant overlapping characteristics between the curvelet coefficient support sets of two adjacent time slices,a weighted operator is constructed in the curvelet domain using the prior support set provided by the previous reconstructed time slice to delineate the main energy distribution range,eff ectively providing prior information for reconstructing adjacent slices.Consequently,the resulting weighted fast iterative shrinkage thresholding algorithm can be used to reconstruct 3D seismic data.The processing of synthetic and field data shows that the proposed method has higher reconstruction accuracy and faster computational speed than the conventional fast iterative shrinkage thresholding algorithm for handling missing 3D seismic data.
基金funded by the Research Project:THTETN.05/24-25,VietnamAcademy of Science and Technology.
文摘Multi-view clustering is a critical research area in computer science aimed at effectively extracting meaningful patterns from complex,high-dimensional data that single-view methods cannot capture.Traditional fuzzy clustering techniques,such as Fuzzy C-Means(FCM),face significant challenges in handling uncertainty and the dependencies between different views.To overcome these limitations,we introduce a new multi-view fuzzy clustering approach that integrates picture fuzzy sets with a dual-anchor graph method for multi-view data,aiming to enhance clustering accuracy and robustness,termed Multi-view Picture Fuzzy Clustering(MPFC).In particular,the picture fuzzy set theory extends the capability to represent uncertainty by modeling three membership levels:membership degrees,neutral degrees,and refusal degrees.This allows for a more flexible representation of uncertain and conflicting data than traditional fuzzy models.Meanwhile,dual-anchor graphs exploit the similarity relationships between data points and integrate information across views.This combination improves stability,scalability,and robustness when handling noisy and heterogeneous data.Experimental results on several benchmark datasets demonstrate significant improvements in clustering accuracy and efficiency,outperforming traditional methods.Specifically,the MPFC algorithm demonstrates outstanding clustering performance on a variety of datasets,attaining a Purity(PUR)score of 0.6440 and an Accuracy(ACC)score of 0.6213 for the 3 Sources dataset,underscoring its robustness and efficiency.The proposed approach significantly contributes to fields such as pattern recognition,multi-view relational data analysis,and large-scale clustering problems.Future work will focus on extending the method for semi-supervised multi-view clustering,aiming to enhance adaptability,scalability,and performance in real-world applications.
文摘This article states the poor database which is very common when being used them. So the demanding database must be all-round, effective collection. When the offering database is poor database, it will affect the application of Supporter Deciding. To this question, the author brings out one solution to solve the poor database basing on the Rough Sets Theory. It can scientifically, correctly, effectively supplement the poor database, and can offer greatly help to enforce the application of data and artificial intelligence.