Estimating the Number of Distinct Values(NDVs)is a critical task in the fields of databases and data streams.Over time,various algorithms for estimating NDVs have been developed,each tailored to different requirements...Estimating the Number of Distinct Values(NDVs)is a critical task in the fields of databases and data streams.Over time,various algorithms for estimating NDVs have been developed,each tailored to different requirements for time,I/O,and accuracy.These algorithms can be broadly categorized into two main types:sampling-based and sketch-based.Sampling-based NDV algorithms improve efficiency by sampling rather than accessing all items,often at the cost of reduced accuracy.In contrast,sketch-based NDV algorithms maintain a compact sketch using hashing to scan the entire dataset,typically offering higher accuracy but at the expense of increased I/O costs.When dealing with large-scale data,scanning the entire table may become infeasible.Thus,the challenge of efficiently and accurately estimating NDVs has persisted for decades.This paper provides a comprehensive review of the fundamental concepts,key techniques,and a comparative analysis of various NDV estimation algorithms.We first briefly examine traditional estimators in chronological order,followed by an in-depth discussion of the newer estimators developed over the past decade,highlighting the specific scenarios in which they are applicable.Furthermore,we illustrate how NDV estimation algorithms have been adapted to address the complexities of modern real-world data environments effectively.Despite significant progress in NDV estimation research,challenges remain in terms of theoretical scalability and practical application.This paper also explores potential future directions,including block sampling NDV estimation,learning-based NDV estimation,and their implications for database applications.展开更多
Microseismic event location is one of the core parameters in microseismic monitoring,and the accuracy of localization will directly affect the effectiveness of engineering applications.However,limited by spatial facto...Microseismic event location is one of the core parameters in microseismic monitoring,and the accuracy of localization will directly affect the effectiveness of engineering applications.However,limited by spatial factors,the geometry of the sensor installation will be close to linear,which makes the localization equation suffer from the pathological problem,and the localization accuracy is greatly reduced.To address this problem,the reasons for the pathological problem are analyzed from the perspective of the objective function residuals and coefficient matrix.The pathological problem is caused by the combined effect of the poorer sensor array and data errors,and its residual isosurface shows a conical distribution,and as the residual value decreases,the apex of the isosurface gradually extends to the far side,and the localization results do not converge.For this reason,an improved regularized Newton downhill localization algorithm is proposed.In this method,firstly,the Newtonian downhill method is improved so that the magnitudes of the seismic source parameters are the same,and the condition number of the coefficient matrix is reduced;then,the L-curve method is used to calculate the regularization factor for the pathological equations,and the coefficient matrix is improved;finally,the pathological equations are regularized,and the seismic source coordinates are obtained by the improved Newtonian downhill method.The results of engineering applications show that compared with the traditional algorithm based on automatic of P-arrival picking,the number of effective microseismic events calculated by the proposed localization algorithm is increased by 194.7%,and the localization accuracy is substantially improved.The proposed algorithm reduces the problem of low accuracy of S-arrival picking and allows localization using only P-wave arrival.The method reduces the quality requirements of the data and significantly improves the utilization of microseismic events and positioning accuracy.展开更多
基金supported in part by the National Science and Technology Major Project(2022ZD0114802)the National Natural Science Foundation of China(Grant Nos.U2241212,61932001)+2 种基金the Beijing Natural Science Foundation(No.4222028)by the Beijing Outstanding Young Scientist Program(No.BJJWZYJH012019100020098)the Huawei-Renmin University joint program on Information Retrieval.
文摘Estimating the Number of Distinct Values(NDVs)is a critical task in the fields of databases and data streams.Over time,various algorithms for estimating NDVs have been developed,each tailored to different requirements for time,I/O,and accuracy.These algorithms can be broadly categorized into two main types:sampling-based and sketch-based.Sampling-based NDV algorithms improve efficiency by sampling rather than accessing all items,often at the cost of reduced accuracy.In contrast,sketch-based NDV algorithms maintain a compact sketch using hashing to scan the entire dataset,typically offering higher accuracy but at the expense of increased I/O costs.When dealing with large-scale data,scanning the entire table may become infeasible.Thus,the challenge of efficiently and accurately estimating NDVs has persisted for decades.This paper provides a comprehensive review of the fundamental concepts,key techniques,and a comparative analysis of various NDV estimation algorithms.We first briefly examine traditional estimators in chronological order,followed by an in-depth discussion of the newer estimators developed over the past decade,highlighting the specific scenarios in which they are applicable.Furthermore,we illustrate how NDV estimation algorithms have been adapted to address the complexities of modern real-world data environments effectively.Despite significant progress in NDV estimation research,challenges remain in terms of theoretical scalability and practical application.This paper also explores potential future directions,including block sampling NDV estimation,learning-based NDV estimation,and their implications for database applications.
基金the financial support from the National Natural Science Foundation of China(Grant no.42077263).
文摘Microseismic event location is one of the core parameters in microseismic monitoring,and the accuracy of localization will directly affect the effectiveness of engineering applications.However,limited by spatial factors,the geometry of the sensor installation will be close to linear,which makes the localization equation suffer from the pathological problem,and the localization accuracy is greatly reduced.To address this problem,the reasons for the pathological problem are analyzed from the perspective of the objective function residuals and coefficient matrix.The pathological problem is caused by the combined effect of the poorer sensor array and data errors,and its residual isosurface shows a conical distribution,and as the residual value decreases,the apex of the isosurface gradually extends to the far side,and the localization results do not converge.For this reason,an improved regularized Newton downhill localization algorithm is proposed.In this method,firstly,the Newtonian downhill method is improved so that the magnitudes of the seismic source parameters are the same,and the condition number of the coefficient matrix is reduced;then,the L-curve method is used to calculate the regularization factor for the pathological equations,and the coefficient matrix is improved;finally,the pathological equations are regularized,and the seismic source coordinates are obtained by the improved Newtonian downhill method.The results of engineering applications show that compared with the traditional algorithm based on automatic of P-arrival picking,the number of effective microseismic events calculated by the proposed localization algorithm is increased by 194.7%,and the localization accuracy is substantially improved.The proposed algorithm reduces the problem of low accuracy of S-arrival picking and allows localization using only P-wave arrival.The method reduces the quality requirements of the data and significantly improves the utilization of microseismic events and positioning accuracy.