期刊文献+
共找到3,963篇文章
< 1 2 199 >
每页显示 20 50 100
基于GPUs可视化技术的心脏辅助诊断系统研究
1
作者 陈宇珂 吴效明 +2 位作者 杨荣骞 欧陕兴 郑理华 《医疗卫生装备》 CAS 2011年第10期16-18,共3页
目的:实现基于GPUs的心脏断层图像的精确分割和三维可视化,完成心脏辅助诊断系统的设计。方法:结合临床专家诊断经验、心脏CT图像先验特征和图像分割算法模型,采用GPUs并行数据处理技术实现心脏结构的分割和三维可视化。结果:完成了CT... 目的:实现基于GPUs的心脏断层图像的精确分割和三维可视化,完成心脏辅助诊断系统的设计。方法:结合临床专家诊断经验、心脏CT图像先验特征和图像分割算法模型,采用GPUs并行数据处理技术实现心脏结构的分割和三维可视化。结果:完成了CT心脏序列图像的精确、快速、鲁棒分割和三维可视化,初步实现了基于GPUs的可视化技术的心脏辅助诊断系统。结论:研究充分利用计算机图形处理单元GPU强大的并行计算能力,解决了医学图像处理和分割中的问题,提高了程序的运行效率,改善了用户体验。 展开更多
关键词 专家系统 心脏 双源CT CUDA gpus
在线阅读 下载PDF
Efficient Concurrent L1-Minimization Solvers on GPUs 被引量:1
2
作者 Xinyue Chu Jiaquan Gao Bo Sheng 《Computer Systems Science & Engineering》 SCIE EI 2021年第9期305-320,共16页
Given that the concurrent L1-minimization(L1-min)problem is often required in some real applications,we investigate how to solve it in parallel on GPUs in this paper.First,we propose a novel self-adaptive warp impleme... Given that the concurrent L1-minimization(L1-min)problem is often required in some real applications,we investigate how to solve it in parallel on GPUs in this paper.First,we propose a novel self-adaptive warp implementation of the matrix-vector multiplication(Ax)and a novel self-adaptive thread implementation of the matrix-vector multiplication(ATx),respectively,on the GPU.The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size.Second,based on the above proposed kernels,the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU,and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache.Finally,we design a concurrent L1-min solver on multiple GPUs.The experimental results have validated the high effectiveness and good performance of our proposed methods. 展开更多
关键词 Concurrent L1-minimization problem dense matrix-vector multiplication fast iterative shrinkage-thresholding algorithm CUDA gpus
在线阅读 下载PDF
Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using multiple GPUs with CUDA and MPI 被引量:3
3
作者 Dawei Mu Po Chen Liqiang Wang 《Earthquake Science》 2013年第6期377-393,共17页
We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units... We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) of NVIDIA and Message Passing Interface (MPI) and obtained a speedup factor of about 28.3 for the single-precision version of our codes and a speedup factor of about 14.9 for the double-precision version. The GPU used in the comparisons is NVIDIA Tesla C2070 Fermi, and the CPU used is Intel Xeon W5660. To effectively overlap inter-process communication with computation, we separate the elements on each subdomain into inner and outer elements and complete the computation on outer elements and fill the MPI buffer first. While the MPI messages travel across the network, the GPU performs computation on inner elements, and all other calculations that do not use information of outer elements from neighboring subdomains. A significant portion of the speedup also comes from a customized matrix-matrix multiplication kernel, which is used extensively throughout our program. Preliminary performance analysis on our parallel GPU codes shows favorable strong and weak scalabilities. 展开更多
关键词 Seismic wave propagation DiscontinuousGalerkin method GPU
在线阅读 下载PDF
An Approach to Parallelization of SIFT Algorithm on GPUs for Real-Time Applications 被引量:4
4
作者 Raghu Raj Prasanna Kumar Suresh Muknahallipatna John McInroy 《Journal of Computer and Communications》 2016年第17期18-50,共33页
Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible fo... Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels. 展开更多
关键词 Scale Invariant Feature Transform (SIFT) Parallel Computing GPU GPU Occupancy Portable Parallel Programming CUDA
在线阅读 下载PDF
Performance Prediction Based on Statistics of Sparse Matrix-Vector Multiplication on GPUs 被引量:1
5
作者 Ruixing Wang Tongxiang Gu Ming Li 《Journal of Computer and Communications》 2017年第6期65-83,共19页
As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo a... As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices. 展开更多
关键词 SPARSE Matrix-Vector MULTIPLICATION Performance Prediction GPU Normal DISTRIBUTION UNIFORM DISTRIBUTION
暂未订购
A Fuzzy Neural Network Based Dynamic Data Allocation Model on Heterogeneous Multi-GPUs for Large-scale Computations
6
作者 Chao-Long Zhang Yuan-Ping Xu +3 位作者 Zhi-Jie Xu Jia He Jing Wang Jian-Hua Adu 《International Journal of Automation and computing》 EI CSCD 2018年第2期181-193,共13页
The parallel computation capabilities of modern graphics processing units (GPUs) have attracted increasing attention from researchers and engineers who have been conducting high computational throughput studies. How... The parallel computation capabilities of modern graphics processing units (GPUs) have attracted increasing attention from researchers and engineers who have been conducting high computational throughput studies. However, current single GPU based engineering solutions are often struggling to fulfill their real-time requirements. Thus, the multi-GPU-based approach has become a popular and cost-effective choice for tackling the demands. In those cases, the computational load balancing over multiple GPU "nodes" is often the key and bottleneck that affect the quality and performance of the real=time system. The existing load balancing approaches are mainly based on the assumption that all GPU nodes in the same computer framework are of equal computational performance, which is often not the case due to cluster design and other legacy issues. This paper presents a novel dynamic load balancing (DLB) model for rapid data division and allocation on heterogeneous GPU nodes based on an innovative fuzzy neural network (FNN). In this research, a 5-state parameter feedback mechanism defining the overall cluster and node performance is proposed. The corresponding FNN-based DLB model will be capable of monitoring and predicting individual node performance under different workload scenarios. A real=time adaptive scheduler has been devised to reorganize the data inputs to each node when necessary to maintain their runtime computational performance. The devised model has been implemented on two dimensional (2D) discrete wavelet transform (DWT) applications for evaluation. Experiment results show that this DLB model enables a high computational throughput while ensuring real=time and precision requirements from complex computational tasks. 展开更多
关键词 Heterogeneous GPU cluster dynamic load balancing fuzzy neural network adaptive scheduler discrete wavelet trans-form.
原文传递
Implementation of a Particle Accelerator Beam Dynamics Code on Multi-Node GPUs
7
作者 Zhicong Liu Ji Qiang 《Journal of Software Engineering and Applications》 2019年第9期321-338,共18页
Particle accelerators play an important role in a wide range of scientific discoveries and industrial applications. The self-consistent multi-particle simulation based on the particle-in-cell (PIC) method has been use... Particle accelerators play an important role in a wide range of scientific discoveries and industrial applications. The self-consistent multi-particle simulation based on the particle-in-cell (PIC) method has been used to study charged particle beam dynamics inside those accelerators. However, the PIC simulation is time-consuming and needs to use modern parallel computers for high-resolution applications. In this paper, we implemented a parallel beam dynamics PIC code on multi-node hybrid architecture computers with multiple Graphics Processing Units (GPUs). We used two methods to parallelize the PIC code on multiple GPUs and observed that the replication method is a better choice for moderate problem size and current computer hardware while the domain decomposition method might be a better choice for large problem size and more advanced computer hardware that allows direct communications among multiple GPUs. Using the multi-node hybrid architectures at Oak Ridge Leadership Computing Facility (OLCF), the optimized GPU PIC code achieves a reasonable parallel performance and scales up to 64 GPUs with 16 million particles. 展开更多
关键词 PARTICLE ACCELERATOR PARTICLE-IN-CELL GPU Parallel BEAM Dynamics Simulation
暂未订购
Real-Time Scheduling Using GPUs--Advanced and More Accurate Proof of Feasibility
8
作者 Peter Fodrek L'udovit Farkas +3 位作者 Michal Blahol Martin Foltin Juraj Hn'it Tomas Murgas 《通讯和计算机(中英文版)》 2012年第8期863-871,共9页
关键词 实时调度 GPU 图形处理器 DDR内存 证明 评估报告 调度子系统 Linux
在线阅读 下载PDF
PELLR: A Permutated ELLPACK-R Format for SpMV on GPUs
9
作者 Zhiqi Wang Tongxiang Gu 《Journal of Computer and Communications》 2020年第4期44-58,共15页
The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and developm... The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and development of Graphics Processing Units (GPUs), high efficient formats for SpMV should be constructed. The performance of SpMV is mainly determinted by the storage format for sparse matrix. Based on the idea of JAD format, this paper improved the ELLPACK-R format, reduced the waiting time between different threads in a warp, and the speed up achieved about 1.5 in our experimental results. Compared with other formats, such as CSR, ELL, BiELL and so on, our format performance of SpMV is optimal over 70 percent of the test matrix. We proposed a method based on parameters to analyze the performance impact on different formats. In addition, a formula was constructed to count the computation and the number of iterations. 展开更多
关键词 SpMV GPU STORAGE FORMAT HIGH PERFORMANCE
在线阅读 下载PDF
Acceleration of Points to Convex Region Correspondence Pose Estimation Algorithm on GPUs for Real-Time Applications
10
作者 Raghu Raj P. Kumar Suresh S. Muknahallipatna John E. McInroy 《Journal of Computer and Communications》 2016年第17期1-17,共18页
In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in th... In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in the previous work showed that the accuracy of the estimated pose was over 99% for position and 84% for orientation estimations respectively. However, for larger objects, the algorithm requires a high number of points to achieve the same accuracy. The requirement of higher number of points makes the algorithm, computationally intensive resulting in the algorithm infeasible for real-time computer vision applications. In this paper, the algorithm is parallelized to run on NVIDIA GPUs. The results indicate that even for objects having more than 2000 points, the algorithm can estimate the pose in real time for each frame of high-resolution videos. 展开更多
关键词 Pose Estimation Parallel Computing GPU CUDA Real Time Image Processing
在线阅读 下载PDF
Mixed precision SpMV on GPUs for irregular data with hierarchical precision selection
11
作者 Jianfei Xu Lianhua He Zhong Jin 《CCF Transactions on High Performance Computing》 2025年第2期129-141,共13页
Sparse matrix-vector multiplication(SpMV)is one of the key kernels extensively employed in both industrial and scientific applications,with its computation and random access incurring a lot of overhead.To capitalize o... Sparse matrix-vector multiplication(SpMV)is one of the key kernels extensively employed in both industrial and scientific applications,with its computation and random access incurring a lot of overhead.To capitalize on higher compute rates and data movement efficiency,there have been efforts to utilize mixed precision SpMV.However,most existing techniques focus on single-grained precision selection for all matrices.In this work,we concentrate on hierarchical precision selection strategies tailored for irregular matrices,driven by the need to achieve optimal load balancing among thread groups executing on GPUs.Based on the concept of strong connection,we firstly introduce a novel adaptive row-grained precision selection strategy that surpasses existing strategy within multi-precision Jacobi methods.Secondly,our experiments have uncovered a range within which converting double-precision floating-point numbers to single-precision floating-point numbers incurs a loss smaller than the machine precision FLT_EPSILON.This range is used for element-grained precision selection.Subsequently,we propose a hierarchical precision selection compressed sparse row format(CSR)storage method and enhance the CSR-Vector kernel,achieving higher relative speedups and load balancing on a benchmark suite composed of 41 matrices compared to existing methods.Finally,we integrate the mixed precision SpMV into the generalized minimal residual method(GMRES)algorithm,achieving faster execution speeds while maintaining similar convergence accuracy as double-precision GMRES. 展开更多
关键词 SpMV Mixed precision GPU CUDA
在线阅读 下载PDF
HI-SM3:High-Performance Implementation of SM3 Hash Function on Heterogeneous GPUs
12
作者 Jian-Kuo Dong Wen Wu +4 位作者 Sheng Lu Le-Tian Sha Fang-Yu Zheng Fu Xiao Hua-Qun Wang 《Journal of Computer Science & Technology》 2025年第6期1546-1562,共17页
Hash functions are essential in cryptographic primitives such as digital signatures,key exchanges,and blockchain technology.SM3,built upon the Merkle-Damgard structure,is a crucial element in Chinese commercial crypto... Hash functions are essential in cryptographic primitives such as digital signatures,key exchanges,and blockchain technology.SM3,built upon the Merkle-Damgard structure,is a crucial element in Chinese commercial cryptographic schemes.Optimizing hash function performance is crucial given the growth of Internet of Things(IoT)devices and the rapid evolution of blockchain technology.In this paper,we introduce a high-performance implementation framework for accelerating the SM3 cryptography hash function,short for HI-SM3,using heterogeneous GPU(graphics processing unit)parallel computing devices.HI-SM3 enhances the implementation of hash functions across four dimensions:parallelism,register utilization,memory access,and instruction efficiency,resulting in significant performance gains across various GPU platforms.Leveraging the NVIDIA RTX 4090 GPU,HI-SM3 achieves a remarkable peak performance of 454.74 GB/s,surpassing OpenSSL on a high-end server CPU(E5-2699V3)with 16 cores by over 150 times.On the Hygon DCU accelerator,a Chinese domestic graphics card,it achieves 113.77 GB/s.Furthermore,compared with the fastest known GPU-based SM3 implementation,HI-SM3 on the same GPU platform exhibits a 3.12x performance improvement.Even on embedded GPUs consuming less than 40W,HI-SM3 attains a throughput of 5.90 GB/s,which is twice as high as that of a server-level CPU.In summary,HI-SM3 provides a significant performance advantage,positioning it as a compelling solution for accelerating hash operations. 展开更多
关键词 SM3 heterogeneous GPU CUDA cryptographic engineering
原文传递
_(ν)GNN:Non‑Uniformly partitioned full‑graph GNN training on mixed GPUs
13
作者 Hemeng Wang Wenqing Lin +1 位作者 Qingxiao Sun Weifeng Liu 《CCF Transactions on High Performance Computing》 2025年第4期305-322,共18页
Graph neural networks(GNNs)can be adapted to GPUs with high computing capability due to massive arithmetic opera-tions.Compared with mini-batch training,full-graph training does not require sampling of the input graph... Graph neural networks(GNNs)can be adapted to GPUs with high computing capability due to massive arithmetic opera-tions.Compared with mini-batch training,full-graph training does not require sampling of the input graph and halo region,avoiding potential accuracy losses.Current deep learning frameworks evenly partition large graphs to scale GNN training to distributed multi-GPU platforms.On the other hand,the rapid revolution of hardware requires technology companies and research institutions to frequently update their equipment to cope with the latest tasks.This results in a large-scale cluster with a mixture of GPUs with various computational capabilities and hardware specifications.However,existing works fail to consider sub-graphs adapted to different GPU generations,leading to inefficient resource utilization and degraded training efficiency.Therefore,we propose_(ν)GNN,a Non-Uniformly partitioned full-graph GNN training framework on heterogeneous distributed platforms._(ν)GNN first models the GNN processing ability of hardware based on various theoretical parameters.Then,_(ν)GNN automatically obtains a reasonable task partitioning scheme by combining hardware,model,and graph dataset information.Finally,_(ν)GNN implements an irregular graph partitioning mechanism that allows GNN training tasks to execute efficiently on distributed heterogeneous systems.The experimental results show that in real-world scenarios with a mixture of GPU generations,_(ν)GNN can outperform other static partitioning schemes based on hardware specifications. 展开更多
关键词 Graph neural network Distributed training Graph partitioning GPU
在线阅读 下载PDF
面向水平孔声波远探测的地震波场正演模拟方法研究
14
作者 闫海涛 杨永龙 +2 位作者 刘继国 叶辉 乐昭 《东华理工大学学报(自然科学版)》 北大核心 2026年第1期61-70,共10页
在高寒高海拔等极端环境条件下,水平定向钻探是隧道精细化探测的重要手段。然而,该方法在灾害发育区仍存在“一孔之见”的局限,可能无法有效揭露溶洞、暗河等灾害体。为了提高勘察精度,做到一孔多用,开展了面向水平孔声波远探测的三维... 在高寒高海拔等极端环境条件下,水平定向钻探是隧道精细化探测的重要手段。然而,该方法在灾害发育区仍存在“一孔之见”的局限,可能无法有效揭露溶洞、暗河等灾害体。为了提高勘察精度,做到一孔多用,开展了面向水平孔声波远探测的三维地震波场正演模拟方法研究,引入多中央处理器(CPU)和多图形处理器(GPU)并行算法。通过模型分区计算和GPU间边界数据交换实现高效波场延拓,对比单极子和偶极子声源激发效果。结果表明,GPU加速使正演计算效率较CPU提升27倍,多GPU并行可进一步缩短计算时间。波场分析显示,单极声源虽能产生反射波,但能量较弱,在近源距范围内与弯曲波严重混叠;而偶极声源在倾斜界面处产生的反射信号更显著,且与弯曲波存在明显时差,更适用于远距离探测。基于多GPU卡异构并行计算能够充分利用节点计算资源,可显著提升计算效率。此外,与单极子声源相比,偶极声源激发的反射波能量更强、分辨率更高,更适用于狭小空间的水平钻孔探测场景。 展开更多
关键词 远探测 偶极声源 GPU 并行 三维正演模拟
在线阅读 下载PDF
基于异构计算的航天测控数传基带架构设计
15
作者 孟景涛 成亚勇 +2 位作者 田之俊 刘云杰 邢翠柳 《航天技术与工程学报》 2026年第1期71-81,共11页
随着我国低轨星座规模的扩大,地面测控数传基带需应对更大处理规模、更高通用性与更强扩展性的挑战。为了构建一个高效、灵活、可扩展的异构计算系统,以满足当前测控数传基带的发展要求,在借鉴目前云计算领域对计算资源的调度管理及异... 随着我国低轨星座规模的扩大,地面测控数传基带需应对更大处理规模、更高通用性与更强扩展性的挑战。为了构建一个高效、灵活、可扩展的异构计算系统,以满足当前测控数传基带的发展要求,在借鉴目前云计算领域对计算资源的调度管理及异构算力发展现状分析的基础上,围绕CPU+GPU+FPGA异构通用计算平台,开展基带信号处理架构研究,该架构设计使用统一资源管理模型对多类型计算资源进行动态调度与协同,支持集群管理并具备良好的跨平台部署能力,且能依据不同场景灵活配置资源以提升性能与能效。试验验证表明,这种基于异构计算的基带信号处理架构能够满足各类测控数传任务体制场景下的通用性和扩展性需求,具备良好的工程应用前景,为未来测控系统中异构计算资源的使用提供了可行的技术路径。 展开更多
关键词 测控数传基带 异构计算 GPU FPGA 信号处理 资源调度
在线阅读 下载PDF
基于GPU和Spark框架的VLBI相关处理架构研究
16
作者 谢科屹 张娟 +3 位作者 童锋贤 郑为民 童力 刘磊 《天文学进展》 北大核心 2026年第1期126-138,共13页
甚长基线干涉测量技术(VLBI)正逐渐向高灵敏度、高时空分辨率方向发展,观测台站数量和观测带宽成倍增加,以至VLBI观测数据量急剧增长,给现有数据处理系统带来了严峻挑战。为满足大规模VLBI数据相关处理的需求,提出并实现了一种基于GPU与... 甚长基线干涉测量技术(VLBI)正逐渐向高灵敏度、高时空分辨率方向发展,观测台站数量和观测带宽成倍增加,以至VLBI观测数据量急剧增长,给现有数据处理系统带来了严峻挑战。为满足大规模VLBI数据相关处理的需求,提出并实现了一种基于GPU与Spark框架的VLBI相关处理架构。测试结果表明,该架构具备高可扩展性与高可靠性,加速比随计算资源扩展呈近线性提升,能够高效处理大规模VLBI数据。这为应对未来VLBI观测任务中的海量数据处理需求奠定了技术基础,也为脉冲星测时阵列中信号合成所需的高速相关处理技术提供了有力支撑。 展开更多
关键词 VLBI 相关处理机 GPU 分布式计算 SPARK
在线阅读 下载PDF
基于Cache功能模拟的GPU内存系统建模
17
作者 袁福焱 郝晓宇 +3 位作者 曹振伟 张森 陈俊仕 安虹 《小型微型计算机系统》 北大核心 2026年第2期477-486,共10页
重用距离分析是一种常用的基于Trace的Cache性能分析方法.然而,随着现代GPU微架构的持续演进,现有基于重用距离理论的GPU内存分析模型由于简化了过多硬件特性,导致了显著的失真.为此,本文提出一种基于Trace和Cache功能模拟的GPU内存系... 重用距离分析是一种常用的基于Trace的Cache性能分析方法.然而,随着现代GPU微架构的持续演进,现有基于重用距离理论的GPU内存分析模型由于简化了过多硬件特性,导致了显著的失真.为此,本文提出一种基于Trace和Cache功能模拟的GPU内存系统建模框架,针对现代GPU的关键内存特性进行了精确建模,包括Sector Cache、自适应L1缓存分配机制以及写直达与写回策略等.通过在Volta架构及多个基准测试套件上的实验验证,论文模型相较现有最先进模型PPT-GPU-Mem在多个关键指标上显著提升了预测精度:L2命中率误差从43.39%降至15.86%,显存读写事务次数误差从42%降至16.85%. 展开更多
关键词 GPU 内存模型 重用距离 功能模拟 NVIDIA NVBit
在线阅读 下载PDF
基于GPU共享的深度学习训练任务加速调度框架
18
作者 林辰汐 李嘉伦 +2 位作者 莫萱 周杰英 吴维刚 《计算机工程与科学》 北大核心 2026年第3期389-397,共9页
深度学习DL在众多业务场景中的应用越来越广泛。如何在GPU集群中高效利用资源训练DL任务并缩短任务的完成时间,受到了工业界和学术界的持续关注。单个DL训练任务往往无法充分利用GPU的全部计算资源,传统调度器的独占式GPU分配导致资源... 深度学习DL在众多业务场景中的应用越来越广泛。如何在GPU集群中高效利用资源训练DL任务并缩短任务的完成时间,受到了工业界和学术界的持续关注。单个DL训练任务往往无法充分利用GPU的全部计算资源,传统调度器的独占式GPU分配导致资源利用率低下。提出一种基于GPU共享的任务调度框架G-Share,允许多个DL任务共享同一个GPU进行训练,即进行混部调度。在感知任务间混部干扰的基础上进行任务调度与资源分配,以提高GPU利用率进而加速任务的执行。具体来说,首先通过离线建模与在线更新的方式刻画任务间相互干扰的信息,并将基于GPU共享的调度问题建模为一个带权二部图最小匹配问题,通过求解该问题来获得资源分配结果,并结合时间片机制实现任务的动态调度来感知在线场景中任务最优混部组合的变化。在商汤科技的DL任务负载数据集上的实验表明,G-Share相比于对比方法实现了20.6%的任务平均完成时间减少。 展开更多
关键词 云计算 深度学习 资源调度 GPU共享 任务间干扰
在线阅读 下载PDF
GPU加速的高维向量聚类算法
19
作者 李忠根 龚盛豪 +3 位作者 于浩然 朱轶凡 柳晴 高云君 《软件学报》 北大核心 2026年第3期1037-1057,共21页
聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领... 聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领域得到了广泛应用.然而,现有的基于密度的聚类算法在处理高维向量数据时将产生极高的时间代价且面临维度灾难等问题,难以在实际场景中部署应用.此外,随着信息技术的发展,高维向量数据规模急剧增加,使用CPU进行高维向量聚类在时间代价和可扩展性等方面将面临更大的挑战.为此,提出一种GPU加速的高维向量聚类算法,通过引入K近邻(K-nearest neighbor,KNN)图索引加速DBSCAN的计算.首先,设计了GPU加速的并行K近邻图构建算法,显著降低了K近邻图索引的构建开销.其次,提出了基于层间并行的K-means树分区算法及基于广度优先搜索和核心近邻图的并行聚类算法,改进了DBSCAN算法的计算流程,实现了高并发向量聚类.最后,在真实向量数据集上进行了大量实验,并将所提出的方法与现有方法进行了性能对比.实验结果表明,所提方法在保证聚类精度的前提下,将大规模向量聚类的效率提高了5.7–2822.5倍. 展开更多
关键词 基于密度的聚类 高维向量 GPU加速 并行计算 K近邻图
在线阅读 下载PDF
面向分布式集群的GPU性能分析与建模方法:现状及展望
20
作者 赵海燕 李志凯 +1 位作者 钱诗友 曹健 《小型微型计算机系统》 北大核心 2026年第1期58-72,共15页
随着人工智能与高性能计算的快速发展,模型复杂度和数据规模持续增长,使得单个GPU难以应对大规模计算任务.因此,分布式GPU集群已成为现代深度学习与科学计算任务的重要基础设施.为了充分发挥此类系统的计算潜力,高效的性能分析与建模方... 随着人工智能与高性能计算的快速发展,模型复杂度和数据规模持续增长,使得单个GPU难以应对大规模计算任务.因此,分布式GPU集群已成为现代深度学习与科学计算任务的重要基础设施.为了充分发挥此类系统的计算潜力,高效的性能分析与建模方法在识别系统瓶颈、优化资源利用以及指导系统设计决策方面显得尤为关键.本文系统综述了分布式集群环境中GPU性能分析与建模的前沿方法.首先深入剖析了当前主流GPU架构及其内部机制,解释其在并行计算任务中高效性的来源.随后介绍了常用的性能指标与分析工具,为架构师与运维工程师根据具体应用需求选择合适的分析框架提供实践指导.文章进一步探讨了包括瓶颈识别、故障归因及细粒度性能刻画在内的先进建模方法.最后,本文讨论了该领域仍存在的挑战,并展望了未来构建更精准、可扩展且可解释的GPU性能分析方法的发展方向. 展开更多
关键词 GPU性能分析方法 分布式集群 深度学习训练及推理 性能建模
在线阅读 下载PDF
上一页 1 2 199 下一页 到第
使用帮助 返回顶部