Many classical clustering algorithms do good jobs on their prerequisite but do not scale well when being applied to deal with very large data sets(VLDS).In this work,a novel division and partition clustering method(DP...Many classical clustering algorithms do good jobs on their prerequisite but do not scale well when being applied to deal with very large data sets(VLDS).In this work,a novel division and partition clustering method(DP) was proposed to solve the problem.DP cut the source data set into data blocks,and extracted the eigenvector for each data block to form the local feature set.The local feature set was used in the second round of the characteristics polymerization process for the source data to find the global eigenvector.Ultimately according to the global eigenvector,the data set was assigned by criterion of minimum distance.The experimental results show that it is more robust than the conventional clusterings.Characteristics of not sensitive to data dimensions,distribution and number of nature clustering make it have a wide range of applications in clustering VLDS.展开更多
Surface modeling with very large data sets is challenging. An efficient method for modeling massive data sets using the high accuracy surface modeling method(HASM) is proposed, and HASM_Big is developed to handle very...Surface modeling with very large data sets is challenging. An efficient method for modeling massive data sets using the high accuracy surface modeling method(HASM) is proposed, and HASM_Big is developed to handle very large data sets. A large data set is defined here as a large spatial domain with high resolution leading to a linear equation with matrix dimensions of hundreds of thousands. An augmented system approach is employed to solve the equality-constrained least squares problem(LSE) produced in HASM_Big, and a block row action method is applied to solve the corresponding very large matrix equations.A matrix partitioning method is used to avoid information redundancy among each block and thereby accelerate the model.Experiments including numerical tests and real-world applications are used to compare the performances of HASM_Big with its previous version, HASM. Results show that the memory storage and computing speed of HASM_Big are better than those of HASM. It is found that the computational cost of HASM_Big is linearly scalable, even with massive data sets. In conclusion,HASM_Big provides a powerful tool for surface modeling, especially when there are millions or more computing grid cells.展开更多
A co-location pattern is a set of spatial features whose instances frequently appear in a spatial neighborhood. This paper efficiently mines the top-k probabilistic prevalent co-locations over spatially uncertain data...A co-location pattern is a set of spatial features whose instances frequently appear in a spatial neighborhood. This paper efficiently mines the top-k probabilistic prevalent co-locations over spatially uncertain data sets and makes the following contributions: 1) the concept of the top-k prob- abilistic prevalent co-locations based on a possible world model is defined; 2) a framework for discovering the top- k probabilistic prevalent co-locations is set up; 3) a matrix method is proposed to improve the computation of the preva- lence probability of a top-k candidate, and two pruning rules of the matrix block are given to accelerate the search for ex- act solutions; 4) a polynomial matrix is developed to further speed up the top-k candidate refinement process; 5) an ap- proximate algorithm with compensation factor is introduced so that relatively large quantity of data can be processed quickly. The efficiency of our proposed algorithms as well as the accuracy of the approximation algorithms is evaluated with an extensive set of experiments using both synthetic and real uncertain data sets.展开更多
Drawing parallels between linguistic constructs and cellular biology,Large Language Models(LLMs)have achieved success in diverse downstream applications for single-cell data analysis.However,to date,it still lacks met...Drawing parallels between linguistic constructs and cellular biology,Large Language Models(LLMs)have achieved success in diverse downstream applications for single-cell data analysis.However,to date,it still lacks methods to take advantage of LLMs to infer Ligand-Receptor(LR)-mediated cell-cell communications for spatially resolved transcriptomic data.Here,we propose SpaCCC to facilitate the inference of spatially resolved cell-cell communications,which relies on our fine-tuned single-cell LLM and functional gene interaction network to embed ligand and receptor genes into a unified latent space.The LR pairs with a significant closer distance in latent space are taken to be more likely to interact with each other.After that,the molecular diffusion and permutation test strategies are respectively employed to calculate the communication strength and filter out communications with low specificities.The benchmarked performance of SpaCCC is evaluated on real single-cell spatial transcriptomic datasets with superiority over other methods.SpaCCC also infers known LR pairs concealed by existing aggregative methods and then identifies communication patterns for specific cell types and their signaling pathways.Furthermore,SpaCCC provides various cell-cell communication visualization results at both single-cell and cell type resolution.In summary,SpaCCC provides a sophisticated and practical tool allowing researchers to decipher spatially resolved cell-cell communications and related communication patterns and signaling pathways based on spatial transcriptome data.展开更多
Intelligent spatial-temporal data analysis,leveraging data such as multivariate time series and geographic information,provides researchers with powerful tools to uncover multiscale patterns and enhance decision-makin...Intelligent spatial-temporal data analysis,leveraging data such as multivariate time series and geographic information,provides researchers with powerful tools to uncover multiscale patterns and enhance decision-making processes.As artificial intelligence advances,intelligent spatial-temporal algorithms have found extensive applications across various disciplines,such as geosciences,biology,and public health.1 Compared to traditional methods,these algorithms are data driven,making them well suited for addressing the complexities of modeling real-world systems.However,their reliance on substantial domain-specific expertise limits their broader applicability.Recently,significant advancements have been made in spatial-temporal large models.Trained on large-scale data,these models exhibit a vast parameter scale,superior generalization capabilities,and multitasking advantages over previous methods.Their high versatility and scalability position them as promising super hubs for multidisciplinary research,integrating knowledge,intelligent algorithms,and research communities from different fields.Nevertheless,achieving this vision will require overcoming numerous critical challenges,offering an expansive and profound space for future exploration.展开更多
随着互联电网运行方式的愈加复杂多变以及广域量测系统部署的越来越完善,以广域测量系统(wide area measurement system,WAMS)量测大数据为基础的实时稳定分析成为必然要求。与此同时,如何对全网多节点毫秒级海量WAMS大数据进行时空同...随着互联电网运行方式的愈加复杂多变以及广域量测系统部署的越来越完善,以广域测量系统(wide area measurement system,WAMS)量测大数据为基础的实时稳定分析成为必然要求。与此同时,如何对全网多节点毫秒级海量WAMS大数据进行时空同步处理和异常数据检测,成为阻碍其发挥更大作用的关键问题。因此,该文提出基于高维随机矩阵描述的WAMS量测大数据建模与分析方法。首先在对WAMS量测数据时空特性分析的基础上,根据高维随机矩阵理论,进行了WAMS量测大数据的高维随机矩阵模型构建,然后推导了其异常数据检测理论和方法,最后在仿真算例上模拟实测量测数据,通过对比不同异常时刻量测数据的Trace检测和谱分布,验证了该量测大数据的建模方法的有效性与适用性。展开更多
基金Supported by National Natural Science Foundation of China(60675039)National High Technology Research and Development Program of China(863 Program)(2006AA04Z217)Hundred Talents Program of Chinese Academy of Sciences
基金Projects(60903082,60975042)supported by the National Natural Science Foundation of ChinaProject(20070217043)supported by the Research Fund for the Doctoral Program of Higher Education of China
文摘Many classical clustering algorithms do good jobs on their prerequisite but do not scale well when being applied to deal with very large data sets(VLDS).In this work,a novel division and partition clustering method(DP) was proposed to solve the problem.DP cut the source data set into data blocks,and extracted the eigenvector for each data block to form the local feature set.The local feature set was used in the second round of the characteristics polymerization process for the source data to find the global eigenvector.Ultimately according to the global eigenvector,the data set was assigned by criterion of minimum distance.The experimental results show that it is more robust than the conventional clusterings.Characteristics of not sensitive to data dimensions,distribution and number of nature clustering make it have a wide range of applications in clustering VLDS.
基金supported by the National Natural Science Foundation of China (Grant Nos. 41541010, 41701456, 41421001, 41590840 & 91425304)the Key Programs of the Chinese Academy of Sciences (Grant No. QYZDY-SSW-DQC007)the Cultivate Project of Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences (Grant No. TSYJS03)
文摘Surface modeling with very large data sets is challenging. An efficient method for modeling massive data sets using the high accuracy surface modeling method(HASM) is proposed, and HASM_Big is developed to handle very large data sets. A large data set is defined here as a large spatial domain with high resolution leading to a linear equation with matrix dimensions of hundreds of thousands. An augmented system approach is employed to solve the equality-constrained least squares problem(LSE) produced in HASM_Big, and a block row action method is applied to solve the corresponding very large matrix equations.A matrix partitioning method is used to avoid information redundancy among each block and thereby accelerate the model.Experiments including numerical tests and real-world applications are used to compare the performances of HASM_Big with its previous version, HASM. Results show that the memory storage and computing speed of HASM_Big are better than those of HASM. It is found that the computational cost of HASM_Big is linearly scalable, even with massive data sets. In conclusion,HASM_Big provides a powerful tool for surface modeling, especially when there are millions or more computing grid cells.
文摘A co-location pattern is a set of spatial features whose instances frequently appear in a spatial neighborhood. This paper efficiently mines the top-k probabilistic prevalent co-locations over spatially uncertain data sets and makes the following contributions: 1) the concept of the top-k prob- abilistic prevalent co-locations based on a possible world model is defined; 2) a framework for discovering the top- k probabilistic prevalent co-locations is set up; 3) a matrix method is proposed to improve the computation of the preva- lence probability of a top-k candidate, and two pruning rules of the matrix block are given to accelerate the search for ex- act solutions; 4) a polynomial matrix is developed to further speed up the top-k candidate refinement process; 5) an ap- proximate algorithm with compensation factor is introduced so that relatively large quantity of data can be processed quickly. The efficiency of our proposed algorithms as well as the accuracy of the approximation algorithms is evaluated with an extensive set of experiments using both synthetic and real uncertain data sets.
基金supported by the National Natural Science Foundation of China-Science and Technology Development Fund(No.62361166662)the National Key R&D Program of China(Nos.2023YFC3503400 and 2022YFC3400400)+4 种基金the Key R&D Program of Hunan Province(Nos.2023GK2004,2023SK2059,and 2023SK2060)the Top 10 Technical Key Project in Hunan Province(No.2023GK1010)the Key Technologies R&D Program of Guangdong Province(No.2023B1111030004 to FFH)the Funds of State Key Laboratory of Chemo/Biosensing and Chemometrics,the National Supercomputing Center in Changsha(http://nscc.hnu.edu.cn/)Peng Cheng Lab.Graduate Research Innovation Project of Hunan Province(No.QL20230101).
文摘Drawing parallels between linguistic constructs and cellular biology,Large Language Models(LLMs)have achieved success in diverse downstream applications for single-cell data analysis.However,to date,it still lacks methods to take advantage of LLMs to infer Ligand-Receptor(LR)-mediated cell-cell communications for spatially resolved transcriptomic data.Here,we propose SpaCCC to facilitate the inference of spatially resolved cell-cell communications,which relies on our fine-tuned single-cell LLM and functional gene interaction network to embed ligand and receptor genes into a unified latent space.The LR pairs with a significant closer distance in latent space are taken to be more likely to interact with each other.After that,the molecular diffusion and permutation test strategies are respectively employed to calculate the communication strength and filter out communications with low specificities.The benchmarked performance of SpaCCC is evaluated on real single-cell spatial transcriptomic datasets with superiority over other methods.SpaCCC also infers known LR pairs concealed by existing aggregative methods and then identifies communication patterns for specific cell types and their signaling pathways.Furthermore,SpaCCC provides various cell-cell communication visualization results at both single-cell and cell type resolution.In summary,SpaCCC provides a sophisticated and practical tool allowing researchers to decipher spatially resolved cell-cell communications and related communication patterns and signaling pathways based on spatial transcriptome data.
基金supported by NSFC No.62372430the Youth Innovation Promotion As-sociation CAS No.2023112.
文摘Intelligent spatial-temporal data analysis,leveraging data such as multivariate time series and geographic information,provides researchers with powerful tools to uncover multiscale patterns and enhance decision-making processes.As artificial intelligence advances,intelligent spatial-temporal algorithms have found extensive applications across various disciplines,such as geosciences,biology,and public health.1 Compared to traditional methods,these algorithms are data driven,making them well suited for addressing the complexities of modeling real-world systems.However,their reliance on substantial domain-specific expertise limits their broader applicability.Recently,significant advancements have been made in spatial-temporal large models.Trained on large-scale data,these models exhibit a vast parameter scale,superior generalization capabilities,and multitasking advantages over previous methods.Their high versatility and scalability position them as promising super hubs for multidisciplinary research,integrating knowledge,intelligent algorithms,and research communities from different fields.Nevertheless,achieving this vision will require overcoming numerous critical challenges,offering an expansive and profound space for future exploration.
文摘随着互联电网运行方式的愈加复杂多变以及广域量测系统部署的越来越完善,以广域测量系统(wide area measurement system,WAMS)量测大数据为基础的实时稳定分析成为必然要求。与此同时,如何对全网多节点毫秒级海量WAMS大数据进行时空同步处理和异常数据检测,成为阻碍其发挥更大作用的关键问题。因此,该文提出基于高维随机矩阵描述的WAMS量测大数据建模与分析方法。首先在对WAMS量测数据时空特性分析的基础上,根据高维随机矩阵理论,进行了WAMS量测大数据的高维随机矩阵模型构建,然后推导了其异常数据检测理论和方法,最后在仿真算例上模拟实测量测数据,通过对比不同异常时刻量测数据的Trace检测和谱分布,验证了该量测大数据的建模方法的有效性与适用性。