摘要
针对高维大样本数据分类的不足,导致传统海量高维数据最近邻检索存在的召回率低和开销大的问题,提出基于改进随机森林的海量高维数据最近邻检索。收集高维数据并利用局部线性嵌入法对数据进行降维处理。创建最近邻检索索引,利用改进随机森林算法确定高维数据类型,实现海量高维数据最近邻检索。为了测试设计最近邻检索的功能,设计对比实验,经过与传统检索方法的对比得出结论:设计的最近邻检索平均召回率提升了1.2%,内存开销和时间开销均有所降低。
In order to solve the problems of low recall rate and high cost in the traditional nearest neighbor retrieval of massive high-dimen-sional data due to the lack of classification of high-dimensional and large sample data,an improved random forest based nearest neighbor retrieval of massive high-dimensional data is proposed.High-dimensional data is collected and dimensionality reduction is carried out by local linear embedding method.The nearest neighbor retrieval index is created,and the high-dimensional data type is determined by the improved random forest algorithm,and the nearest neighbor retrieval of massive high-dimensional data is realized.In order to test the function of the designed nearest neighbor retrieval,a comparative experiment is designed.By com-paring with the traditional retrieval method,the conclusion is drawn that the average recall rate of the designed nearest neighbor retrieval is increased by 1.2%,and both memory and time cost are reduced.
作者
孙昊
SUN Hao(Urumqi Vocational University,Urumqi 830002 China)
出处
《自动化技术与应用》
2022年第11期73-76,共4页
Techniques of Automation and Applications
基金
乌鲁木齐职业大学2019年度校级重点科研项目(2018XZ001)。
关键词
改进随机森林算法
海量高维数据
数据检索
最近邻检索
improved random forest algorithm
massive high-dimensional data
data retrieval
nearest neighbor retrieval