A novel Hilbert-curve is introduced for parallel spatial data partitioning, with consideration of the huge-amount property of spatial information and the variable-length characteristic of vector data items. Based on t...A novel Hilbert-curve is introduced for parallel spatial data partitioning, with consideration of the huge-amount property of spatial information and the variable-length characteristic of vector data items. Based on the improved Hilbert curve, the algorithm can be designed to achieve almost-uniform spatial data partitioning among multiple disks in parallel spatial databases. Thus, the phenomenon of data imbalance can be significantly avoided and search and query efficiency can be enhanced.展开更多
As a dynamic projection to latent structures(PLS)method with a good output prediction ability,dynamic inner PLS(DiPLS)is widely used in the prediction of key performance indi-cators.However,due to the oblique decompos...As a dynamic projection to latent structures(PLS)method with a good output prediction ability,dynamic inner PLS(DiPLS)is widely used in the prediction of key performance indi-cators.However,due to the oblique decomposition of the input space by DiPLS,there are false alarms in the actual industrial process during fault detection.To address the above problems,a dynamic modeling method based on autoregressive-dynamic inner total PLS(AR-DiTPLS)is proposed.The method first uses the regression relation matrix to decompose the input space orthogonally,which reduces useless information for the predic-tion output in the quality-related dynamic subspace.Then,a vector autoregressive model(VAR)is constructed for the predic-tion score to separate dynamic information and static informa-tion.Based on the VAR model,appropriate statistical indicators are further constructed for online monitoring,which reduces the occurrence of false alarms.The effectiveness of the method is verified by a Tennessee-Eastman industrial simulation process and a three-phase flow system.展开更多
In land-use data generalization, the removal of insignificant parcel withsmall size is the most frequently used operator. Traditionally for the generalization method, thesmall parcel is assigned completely to one of i...In land-use data generalization, the removal of insignificant parcel withsmall size is the most frequently used operator. Traditionally for the generalization method, thesmall parcel is assigned completely to one of its neighbors. This study tries to improve thegeneralization by separating the insignificant parcel into parts around the weighted skeleton andassigning these parts to different neighbors. The distribution of the weighted skeleton depends onthe compatibility between the removed object and its neighbor, which considers not only topologicalrelationship but also distance relationship and semantic similarity. This process is based on theDelaunay triangulat'on model. This paper gives the detailed geometric algorithms for this operation.展开更多
Clustering by local direction centrality(CDC)is a newly proposed versatile algorithm adept at identifying clusters with heteroge-neous density and weak connectivity.Its advantages in accuracy and robustness have been ...Clustering by local direction centrality(CDC)is a newly proposed versatile algorithm adept at identifying clusters with heteroge-neous density and weak connectivity.Its advantages in accuracy and robustness have been widely validated in computer science,bioscience,and geoscience.However,it has a quadratic time com-plexity due to costly K-nearest neighbor search and internal con-nection operations,which hinder its ability to handle large-scale datasets.To improve its computational efficiency and scalability,we proposed a performance enhanced distributed framework of CDc,named D-CDC,by workflow-level algorithm optimization and dis-tributed computational acceleration.Specifically,KDTree spatial indexing is leveraged to reduce the KNN search complexity to logarithmic time,and KNN constraints and disjoint sets are intro-duced to decrease the computational cost of internal connection.Besides,to minimize cross-partition communication,we designed an Improved QuadTree(ImprovedQT)spatial partitioning method by considering cluster completeness and shape regularity.We then implemented D-CDC on the Apache Spark framework using Resilient Distributed Dataset(RDD)customization techniques.Experiments on six synthetic datasets demonstrate that D-CDC preserves the clustering accuracy of the original cDC in general and achieves up to 60o-fold speedup by reducing the runtime from 142,590 s to 236 s on million-scale datasets.A real-world case study on over 2 million enterprise registration POl data in Chinese main-land further validates that D-CDC can identify fine-grained and weakly connected aggregation patterns of large-scale geospatial data in an effi cient manner.展开更多
基金Funded by the National 863 Program of China (No. 2005AA113150), and the National Natural Science Foundation of China (No.40701158).
文摘A novel Hilbert-curve is introduced for parallel spatial data partitioning, with consideration of the huge-amount property of spatial information and the variable-length characteristic of vector data items. Based on the improved Hilbert curve, the algorithm can be designed to achieve almost-uniform spatial data partitioning among multiple disks in parallel spatial databases. Thus, the phenomenon of data imbalance can be significantly avoided and search and query efficiency can be enhanced.
基金supported by the National Natural Science Foundation of China(62273354,61673387,61833016).
文摘As a dynamic projection to latent structures(PLS)method with a good output prediction ability,dynamic inner PLS(DiPLS)is widely used in the prediction of key performance indi-cators.However,due to the oblique decomposition of the input space by DiPLS,there are false alarms in the actual industrial process during fault detection.To address the above problems,a dynamic modeling method based on autoregressive-dynamic inner total PLS(AR-DiTPLS)is proposed.The method first uses the regression relation matrix to decompose the input space orthogonally,which reduces useless information for the predic-tion output in the quality-related dynamic subspace.Then,a vector autoregressive model(VAR)is constructed for the predic-tion score to separate dynamic information and static informa-tion.Based on the VAR model,appropriate statistical indicators are further constructed for online monitoring,which reduces the occurrence of false alarms.The effectiveness of the method is verified by a Tennessee-Eastman industrial simulation process and a three-phase flow system.
文摘In land-use data generalization, the removal of insignificant parcel withsmall size is the most frequently used operator. Traditionally for the generalization method, thesmall parcel is assigned completely to one of its neighbors. This study tries to improve thegeneralization by separating the insignificant parcel into parts around the weighted skeleton andassigning these parts to different neighbors. The distribution of the weighted skeleton depends onthe compatibility between the removed object and its neighbor, which considers not only topologicalrelationship but also distance relationship and semantic similarity. This process is based on theDelaunay triangulat'on model. This paper gives the detailed geometric algorithms for this operation.
基金supported by National Natural Science Foundation of China[No.42090010,No.41971349]the Fundamental Research Funds for the Central Universities,China[No.2042022dx0001].
文摘Clustering by local direction centrality(CDC)is a newly proposed versatile algorithm adept at identifying clusters with heteroge-neous density and weak connectivity.Its advantages in accuracy and robustness have been widely validated in computer science,bioscience,and geoscience.However,it has a quadratic time com-plexity due to costly K-nearest neighbor search and internal con-nection operations,which hinder its ability to handle large-scale datasets.To improve its computational efficiency and scalability,we proposed a performance enhanced distributed framework of CDc,named D-CDC,by workflow-level algorithm optimization and dis-tributed computational acceleration.Specifically,KDTree spatial indexing is leveraged to reduce the KNN search complexity to logarithmic time,and KNN constraints and disjoint sets are intro-duced to decrease the computational cost of internal connection.Besides,to minimize cross-partition communication,we designed an Improved QuadTree(ImprovedQT)spatial partitioning method by considering cluster completeness and shape regularity.We then implemented D-CDC on the Apache Spark framework using Resilient Distributed Dataset(RDD)customization techniques.Experiments on six synthetic datasets demonstrate that D-CDC preserves the clustering accuracy of the original cDC in general and achieves up to 60o-fold speedup by reducing the runtime from 142,590 s to 236 s on million-scale datasets.A real-world case study on over 2 million enterprise registration POl data in Chinese main-land further validates that D-CDC can identify fine-grained and weakly connected aggregation patterns of large-scale geospatial data in an effi cient manner.