摘要
NUMA(non-uniform memory access)是现代多核、多路处理器平台上主流的内存访问架构,NUMA访问延迟对数据库的查询性能有较大影响,因此如何降低查询处理中跨NUMA节点的访问延迟是现代内存数据库查询优化的热点问题之一.不同的处理器在NUMA架构、NUMA延迟等方面差异较大,因此NUMA优化技术需要与硬件特性相结合.基于内存数据库执行代价最高和对数据局部性依赖最强的内存外键连接算法,面向代表性的ARM、Intel CLX、Intel ICX、AMD Zen2和AMD Zen3这5个处理器NUMA架构和延迟特征,探索了不同NUMA优化方法,包括NUMA-conscious和NUMA-oblivious实现技术.在数据存储、数据分片、连接中间结果缓存等方面采用不同的优化方案,比较了不同处理器架构上的算法性能,实验结果表明,NUMA-conscious优化策略需软、硬件相结合,其中Radix Join对NUMA延迟敏感度为中性,在5个不同的处理器平台上,NUMA优化性能收益稳定在30%左右,NPO算法对NUMA延迟敏感度较高,在不同平台NUMA优化性能收益在38%–57%,Vector Join算法对NUMA延迟敏感但影响幅度较小,NUMA优化性能收益在1%–25%之间,且在算法性能特征上,Vector Join受cache效率影响比NUMA延迟影响更大;NUMA-conscious优化技术在ARM平台差异较大,在x86平台差异极小,NUMA-oblivious算法复杂度更低,具有较好的通用性.从处理器硬件发展趋势来看,降低NUMA访问延迟可以有效地降低不同NUMA-conscious优化算法的性能差异,简化连接算法的复杂度,提高连接操作性能.
Non-uniform memory access(NUMA)is the mainstream memory access architecture for state-of-the-art multicore and multi-way processor platforms.Reducing the latency of cross-NUMA node accesses during queries is a key issue for modern in-memory database query optimization techniques.Due to the differences in NUMA architectures and NUMA latency across various processors,NUMA optimization techniques should be combined with hardware characteristics.This study focuses on the in-memory foreign key join algorithm,which has high cost and strong locality of data dependency in in-memory databases,and explores different NUMA optimization techniques,including NUMA-conscious and NUMA-oblivious implementations,on five platforms featuring ARM,Intel CLX/ICX,and AMD Zen2/Zen3 processors.The study also compares the performance of the algorithms across different processor platforms with strategies such as data storage,data partitioning,and join intermediate result caching.Experimental results show that the NUMA-conscious optimization strategy requires the integration of both software and hardware.Radix Join demonstrates neutral sensitivity to NUMA latency,with NUMA optimization gains constantly around 30%.The NPO algorithm shows higher sensitivity to NUMA latency,with NUMA optimization gains ranging from 38%to 57%.The Vector Join algorithm is sensitive to NUMA latency,but the impact is relatively minor,with NUMA optimization gains varying from 1%to 25%.For algorithm performance characteristics,cache efficiency influences the Vector Join performance more than NUMA latency.NUMA-conscious optimization techniques show significant differences on ARM platforms,while the differences are minimal on x86 platforms.The less complex NUMA-oblivious algorithms exhibit greater generality.Given hardware trends,reducing NUMA latency can effectively reduce performance gaps in NUMA-conscious optimization techniques,simplify join algorithm complexity,and improve join operation performance.
作者
韩瑞琛
张延松
刘专
张宇
焦敏
王珊
HAN Rui-Chen;ZHANG Yan-Song;LIU Zhuan;ZHANG Yu;JIAO Min;WANG Shan(Engineering Research Center of Database and Business Intelligence,Ministry of Education,Beijing 100872,China;Key Laboratory of Data Engineering and Knowledge Engineering(Renmin University),Ministry of Education,Beijing 100872,China;School of Information,Renmin University of China,Beijing 100872,China;National Survey Research Center,Renmin University of China,Beijing 100872,China;Intel China Research Center Ltd.,Beijing 100190,China;National Satellite Meteorological Center,Beijing 100081,China)
出处
《软件学报》
北大核心
2025年第12期5821-5850,共30页
Journal of Software
基金
国家重点研发计划(2023YFB4503600)
国家自然科学基金(U23A20299,62172424,62276270,62322214)。