Information analysis of high dimensional data was carried out through similarity measure application. High dimensional data were considered as the a typical structure. Additionally, overlapped and non-overlapped data ...Information analysis of high dimensional data was carried out through similarity measure application. High dimensional data were considered as the a typical structure. Additionally, overlapped and non-overlapped data were introduced, and similarity measure analysis was also illustrated and compared with conventional similarity measure. As a result, overlapped data comparison was possible to present similarity with conventional similarity measure. Non-overlapped data similarity analysis provided the clue to solve the similarity of high dimensional data. Considering high dimensional data analysis was designed with consideration of neighborhoods information. Conservative and strict solutions were proposed. Proposed similarity measure was applied to express financial fraud among multi dimensional datasets. In illustrative example, financial fraud similarity with respect to age, gender, qualification and job was presented. And with the proposed similarity measure, high dimensional personal data were calculated to evaluate how similar to the financial fraud. Calculation results show that the actual fraud has rather high similarity measure compared to the average, from minimal 0.0609 to maximal 0.1667.展开更多
This paper establishes the asymptotic independence between the quadratic form z^(T)Az and maximum max1≤i≤p|zi|of a sequence of independent sub-Gaussian random variables z=(z1m…zp)^(T).Based on this theoretical resu...This paper establishes the asymptotic independence between the quadratic form z^(T)Az and maximum max1≤i≤p|zi|of a sequence of independent sub-Gaussian random variables z=(z1m…zp)^(T).Based on this theoretical result,we find the asymptotic joint distribution for the quadratic form and maximum,which can be applied into the high-dimensional testing problems.By combining the sum-type test and the max-type test,we propose the Fisher’s combination tests for the one-sample mean test and two-sample mean test.Under this novel general framework,several strong assumptions in existing literature have been relaxed.Monte Carlo simulation has been done which shows that our proposed tests are strongly robust to both sparse and dense data.展开更多
The journal Genomics,Proteomics&Bioinformatics(GPB)invites leading scholars to contribute high-quality manuscripts for a special issue on“AI+BT for Big Clinical Omics Data”scheduled for publication in the Autumn...The journal Genomics,Proteomics&Bioinformatics(GPB)invites leading scholars to contribute high-quality manuscripts for a special issue on“AI+BT for Big Clinical Omics Data”scheduled for publication in the Autumn of 2026.This special issue seeks submissions that focus on integrating artificial intelligence(AI)and biotechnologies(BT)to largely improve the collection,modelling,analysis,and application of large-scale clinical omics data.The goal is to address the challenges posed by the high-dimensional and dynamic nature of big clinical omics data and explore their potential to advance the diagnosis and treatment of complex diseases.展开更多
Complex diseases do not always follow gradual progressions.Instead,they may experience sudden shifts known as critical states or tipping points,where a marked qualitative change occurs.Detecting such a pivotal transit...Complex diseases do not always follow gradual progressions.Instead,they may experience sudden shifts known as critical states or tipping points,where a marked qualitative change occurs.Detecting such a pivotal transition or pre-deterioration state holds paramount importance due to its association with severe disease deterioration.Nevertheless,the task of pinpointing the pre-deterioration state for complex diseases remains an obstacle,especially in scenarios involving high-dimensional data with limited samples,where conventional statistical methods frequently prove inadequate.In this study,we introduce an innovative quantitative approach termed sample-specific causality network entropy(SCNE),which infers a sample-specific causality network for each individual and effectively quantifies the dynamic alterations in causal relations among molecules,thereby capturing critical points or pre-deterioration states of complex diseases.We substantiated the accuracy and efficacy of our approach via numerical simulations and by examining various real-world datasets,including single-cell data of epithelial cell deterioration(EPCD)in colorectal cancer,influenza infection data,and three different tumor cases from The Cancer Genome Atlas(TCGA)repositories.Compared to other existing six single-sample methods,our proposed approach exhibits superior performance in identifying critical signals or pre-deterioration states.Additionally,the efficacy of computational findings is underscored by analyzing the functionality of signaling biomarkers.展开更多
The large number of environmental problems faced by society in recent years has driven researchers to collect and study massive amounts of data in order to understand the complex relations that exist between people an...The large number of environmental problems faced by society in recent years has driven researchers to collect and study massive amounts of data in order to understand the complex relations that exist between people and the environment in which we live.Such datasets are often high dimensional and heterogeneous in nature,with complex geospatial relations.Analysing such data can be challenging,especially when there is a need to maintain spatial awareness as the non-spatial attributes are studied.Geo-Coordinated Parallel Coordinates(GCPC)is a geovisual analytics approach designed to support exploration and analysis within complex geospatial environmental data.Parallel coordinates are tightly coupled with a geospatial representation and an investigative scatterplot,all of which can be used to show,reorganize,filter,and highlight the high dimensional,heterogeneous,and geospatial aspects of the data.Two sets of field trials were conducted with expert data analysts to validate the real-world benefits of the approach for studying environmental data.The results of these evaluations were positive,providing real-world evidence and new insights regarding the value of using GCPC to explore among environmental datasets when there is a need to remain aware of the geospatial aspects of the data as the non-spatial elements are studied.展开更多
By skeptics and undecided we refer to nodes in clustered social networks that cannot be assigned easily to any of the clusters.Such nodes are typically found either at the interface between clusters(the undecided)or a...By skeptics and undecided we refer to nodes in clustered social networks that cannot be assigned easily to any of the clusters.Such nodes are typically found either at the interface between clusters(the undecided)or at their boundaries(the skeptics).Identifying these nodes is relevant in marketing applications like voter targeting,because the persons represented by such nodes are often more likely to be affected in marketing campaigns than nodes deeply within clusters.So far this identification task is not as well studied as other network analysis tasks like clustering,identifying central nodes,and detecting motifs.We approach this task by deriving novel geometric features from the network structure that naturally lend themselves to an interactive visual approach for identifying interface and boundary nodes.展开更多
Many recently proposed subspace clustering methods suffer from two severe problems.First,the algorithms typically scale exponentially with the data dimensionality or the subspace dimensionality of clusters.Second,the ...Many recently proposed subspace clustering methods suffer from two severe problems.First,the algorithms typically scale exponentially with the data dimensionality or the subspace dimensionality of clusters.Second,the clustering results are often sensitive to input parameters.In this paper,a fast algorithm of subspace clustering using attribute clustering is proposed to overcome these limitations.This algorithm first filters out redundant attributes by computing the Gini coef-ficient.To evaluate the correlation of every two non-redundant attributes,the relation matrix of non-redund-ant attributes is constructed based on the relation function of two dimensional united Gini coefficients.After applying an overlapping clustering algorithm on the relation matrix,the candidate of all interesting subspaces is achieved.Finally,all subspace clusters can be derived by clustering on interesting subspaces.Experiments on both synthesis and real datasets show that the new algorithm not only achieves a significant gain of runtime and quality to find subspace clusters,but also is insensitive to input parameters.展开更多
基金Project(RDF 11-02-03)supported by the Research Development Fund of XJTLU,China
文摘Information analysis of high dimensional data was carried out through similarity measure application. High dimensional data were considered as the a typical structure. Additionally, overlapped and non-overlapped data were introduced, and similarity measure analysis was also illustrated and compared with conventional similarity measure. As a result, overlapped data comparison was possible to present similarity with conventional similarity measure. Non-overlapped data similarity analysis provided the clue to solve the similarity of high dimensional data. Considering high dimensional data analysis was designed with consideration of neighborhoods information. Conservative and strict solutions were proposed. Proposed similarity measure was applied to express financial fraud among multi dimensional datasets. In illustrative example, financial fraud similarity with respect to age, gender, qualification and job was presented. And with the proposed similarity measure, high dimensional personal data were calculated to evaluate how similar to the financial fraud. Calculation results show that the actual fraud has rather high similarity measure compared to the average, from minimal 0.0609 to maximal 0.1667.
基金supported by the National Natural Science Foundation of China(Grant Nos.12101335 and 12271271)the Natural Science Foundation of Tianjin(Grant No.21JCQNJC00020)+4 种基金the Fundamental Research Funds for the Central Universities,Nankai University(Grant Nos.63211088 and 63221050)supported by National Natural Science Foundation of China(Grant No.12101332)supported by Shenzhen Wukong Investment Company,the Fundamental Research Funds for the Central Universities under(Grant No.ZB22000105)the China National Key R&D Program(Grant Nos.2019YFC1908502,2022YFA1003703,2022YFA1003802,2022YFA1003803)the National Natural Science Foundation of China(Grants Nos.12271271,11925106,12231011,11931001 and 11971247)。
文摘This paper establishes the asymptotic independence between the quadratic form z^(T)Az and maximum max1≤i≤p|zi|of a sequence of independent sub-Gaussian random variables z=(z1m…zp)^(T).Based on this theoretical result,we find the asymptotic joint distribution for the quadratic form and maximum,which can be applied into the high-dimensional testing problems.By combining the sum-type test and the max-type test,we propose the Fisher’s combination tests for the one-sample mean test and two-sample mean test.Under this novel general framework,several strong assumptions in existing literature have been relaxed.Monte Carlo simulation has been done which shows that our proposed tests are strongly robust to both sparse and dense data.
文摘The journal Genomics,Proteomics&Bioinformatics(GPB)invites leading scholars to contribute high-quality manuscripts for a special issue on“AI+BT for Big Clinical Omics Data”scheduled for publication in the Autumn of 2026.This special issue seeks submissions that focus on integrating artificial intelligence(AI)and biotechnologies(BT)to largely improve the collection,modelling,analysis,and application of large-scale clinical omics data.The goal is to address the challenges posed by the high-dimensional and dynamic nature of big clinical omics data and explore their potential to advance the diagnosis and treatment of complex diseases.
基金supported by National Natural Science Foundation of China(nos.T2341022,12322119,62172164,and 12271180)Guangdong Provincial Key Laboratory of Human Digital Twin(2022B1212010004)+2 种基金Educational Commission of Guangdong Province of China(2023KQNCX073)the Natural Science Foundation of Guangdong Province of China(2022A-1515110759,and 2023A1515110558)Fundamental Research Funds for the Central Universities(2023ZYGXZR077).
文摘Complex diseases do not always follow gradual progressions.Instead,they may experience sudden shifts known as critical states or tipping points,where a marked qualitative change occurs.Detecting such a pivotal transition or pre-deterioration state holds paramount importance due to its association with severe disease deterioration.Nevertheless,the task of pinpointing the pre-deterioration state for complex diseases remains an obstacle,especially in scenarios involving high-dimensional data with limited samples,where conventional statistical methods frequently prove inadequate.In this study,we introduce an innovative quantitative approach termed sample-specific causality network entropy(SCNE),which infers a sample-specific causality network for each individual and effectively quantifies the dynamic alterations in causal relations among molecules,thereby capturing critical points or pre-deterioration states of complex diseases.We substantiated the accuracy and efficacy of our approach via numerical simulations and by examining various real-world datasets,including single-cell data of epithelial cell deterioration(EPCD)in colorectal cancer,influenza infection data,and three different tumor cases from The Cancer Genome Atlas(TCGA)repositories.Compared to other existing six single-sample methods,our proposed approach exhibits superior performance in identifying critical signals or pre-deterioration states.Additionally,the efficacy of computational findings is underscored by analyzing the functionality of signaling biomarkers.
基金This work was supported in part by grant from Social Sciences and Humanities Research Council of Canada(SSHRC)(895-2011-1011)held by the second author.
文摘The large number of environmental problems faced by society in recent years has driven researchers to collect and study massive amounts of data in order to understand the complex relations that exist between people and the environment in which we live.Such datasets are often high dimensional and heterogeneous in nature,with complex geospatial relations.Analysing such data can be challenging,especially when there is a need to maintain spatial awareness as the non-spatial attributes are studied.Geo-Coordinated Parallel Coordinates(GCPC)is a geovisual analytics approach designed to support exploration and analysis within complex geospatial environmental data.Parallel coordinates are tightly coupled with a geospatial representation and an investigative scatterplot,all of which can be used to show,reorganize,filter,and highlight the high dimensional,heterogeneous,and geospatial aspects of the data.Two sets of field trials were conducted with expert data analysts to validate the real-world benefits of the approach for studying environmental data.The results of these evaluations were positive,providing real-world evidence and new insights regarding the value of using GCPC to explore among environmental datasets when there is a need to remain aware of the geospatial aspects of the data as the non-spatial elements are studied.
文摘By skeptics and undecided we refer to nodes in clustered social networks that cannot be assigned easily to any of the clusters.Such nodes are typically found either at the interface between clusters(the undecided)or at their boundaries(the skeptics).Identifying these nodes is relevant in marketing applications like voter targeting,because the persons represented by such nodes are often more likely to be affected in marketing campaigns than nodes deeply within clusters.So far this identification task is not as well studied as other network analysis tasks like clustering,identifying central nodes,and detecting motifs.We approach this task by deriving novel geometric features from the network structure that naturally lend themselves to an interactive visual approach for identifying interface and boundary nodes.
基金This work was supported by the National Basic Research Program of China(No.2007CB307100)the National Natural Science Foundation of China(Grant No.60432010).
文摘Many recently proposed subspace clustering methods suffer from two severe problems.First,the algorithms typically scale exponentially with the data dimensionality or the subspace dimensionality of clusters.Second,the clustering results are often sensitive to input parameters.In this paper,a fast algorithm of subspace clustering using attribute clustering is proposed to overcome these limitations.This algorithm first filters out redundant attributes by computing the Gini coef-ficient.To evaluate the correlation of every two non-redundant attributes,the relation matrix of non-redund-ant attributes is constructed based on the relation function of two dimensional united Gini coefficients.After applying an overlapping clustering algorithm on the relation matrix,the candidate of all interesting subspaces is achieved.Finally,all subspace clusters can be derived by clustering on interesting subspaces.Experiments on both synthesis and real datasets show that the new algorithm not only achieves a significant gain of runtime and quality to find subspace clusters,but also is insensitive to input parameters.