期刊文献+
共找到10,365篇文章
< 1 2 250 >
每页显示 20 50 100
Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture 被引量:13
1
作者 郑方 李宏亮 +3 位作者 吕晖 过锋 许晓红 谢向辉 《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第1期145-162,共18页
Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which h... Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS. 展开更多
关键词 heterogeneous many-core processor data stream transfer register-level communication mechanism hardwaresynchronization technique processor prototype
原文传递
Fault Tolerance Mechanism in Chip Many-Core Processors 被引量:1
2
作者 张磊 韩银和 +1 位作者 李华伟 李晓维 《Tsinghua Science and Technology》 SCIE EI CAS 2007年第S1期169-174,共6页
As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performan... As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time. 展开更多
关键词 chip many-core processors YIELD fault tolerance RECONFIGURATION NETWORK-ON-CHIP
原文传递
Parallelization and sustainability of distributed genetic algorithms on many-core processors
3
作者 Yuji Sato Mikiko Sato 《International Journal of Intelligent Computing and Cybernetics》 EI 2014年第1期2-23,共22页
Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core pr... Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs. 展开更多
关键词 Evolutionary computation Genetic algorithms Fault identification many-core processors PARALLELIZATION
在线阅读 下载PDF
Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight 被引量:1
4
作者 Min Li Chao Yang +3 位作者 Qiao Sun Wen-Jing Ma Wen-Long Cao Yu-Long Ao 《Journal of Computer Science & Technology》 SCIE EI CSCD 2019年第1期77-93,共17页
With the advent of the big data era,the amounts of sampling data and the dimensions of data features are rapidly growing.It is highly desired to enable fast and efficient clustering of unlabeled samples based on featu... With the advent of the big data era,the amounts of sampling data and the dimensions of data features are rapidly growing.It is highly desired to enable fast and efficient clustering of unlabeled samples based on feature similarities. As a fundamental primitive for data clustering,the k-means operation is receiving increasingly more attentions today.To achieve high performance k-means computations on modern multi-core/many-core systems,we propose a matrix-based fused framework that can achieve high performance by conducting computations on a distance matrix and at the same time can improve the memory reuse through the fusion of the distance-matrix computation and the nearest centroids reduction.We implement and optimize the parallel k-means algorithm on the SW26010 many-core processor,which is the major horsepower of Sunway TaihuLight.In particular,we design a task mapping strategy for load-balanced task distribution,a data sharing scheme to reduce the memory footprint and a register blocking strategy to increase the data locality.Optimization techniques such as instruction reordering and double buffering are further applied to improve the sustained performance.Discussions on block-size tuning and performance modeling are also presented.We show by experiments on both randomly generated and real-world datasets that our parallel implementation of k-means on SW26010 can sustain a double-precision performance of over 348.1 Gflops,which is 46.9% of the peak performance and 84%of the theoretical performance upper bound on a single core group,and can achieve a nearly ideal scalability to the whole SW26010 processor of four core groups.Performance comparisons with the previous state-of-the-art on both CPU and GPU are also provided to show the superiority of our optimized k-means kernel. 展开更多
关键词 PARALLEL K-MEANS performance optimization SW26010 processor Sunway TaihuLight
原文传递
Towards optimized tensor code generation for deep learning on sunway many-core processor
5
作者 Mingzhen LI Changxi LIU +8 位作者 Jianjin LIAO Xuegui ZHENG Hailong YANG Rujun SUN Jun XU Lin GAN Guangwen YANG Zhongzhi LUAN Depei QIAN 《Frontiers of Computer Science》 SCIE EI CSCD 2024年第2期1-15,共15页
The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among th... The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among the existing deep learning compilers,TVM is well known for its efficiency in code generation and optimization across diverse hardware devices.In the meanwhile,the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads.This paper combines the trends in these two directions.Specifically,we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway.In addition,we leverage the architecture features during the compilation such as core group for massive parallelism,DMA for high bandwidth memory transfer and local device memory for data locality,in order to generate efficient codes for deep learning workloads on Sunway.The experiment results show that the codes generated by swTVM achieve 1.79x improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway,across eight representative benchmarks.This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind.We believe this work will encourage more people to embrace the power of deep learning and Sunwaymany-coreprocessor. 展开更多
关键词 sunway processor deep learning compiler code generation performance optimization
原文传递
A Scalable Interconnection Scheme in Many-Core Systems
6
作者 Allam Abumwais Mujahed Eleyat 《Computers, Materials & Continua》 SCIE EI 2023年第10期615-632,共18页
Recent architectures of multi-core systems may have a relatively large number of cores that typically ranges from tens to hundreds;therefore called many-core systems.Such systems require an efficient interconnection n... Recent architectures of multi-core systems may have a relatively large number of cores that typically ranges from tens to hundreds;therefore called many-core systems.Such systems require an efficient interconnection network that tries to address two major problems.First,the overhead of power and area cost and its effect on scalability.Second,high access latency is caused by multiple cores’simultaneous accesses of the same shared module.This paper presents an interconnection scheme called N-conjugate Shuffle Clusters(NCSC)based on multi-core multicluster architecture to reduce the overhead of the just mentioned problems.NCSC eliminated the need for router devices and their complexity and hence reduced the power and area costs.It also resigned and distributed the shared caches across the interconnection network to increase the ability for simultaneous access and hence reduce the access latency.For intra-cluster communication,Multi-port Content Addressable Memory(MPCAM)is used.The experimental results using four clusters and four cores each indicated that the average access latency for a write process is 1.14785±0.04532 ns which is nearly equal to the latency of a write operation in MPCAM.Moreover,it was demonstrated that the average read latency within a cluster is 1.26226±0.090591 ns and around 1.92738±0.139588 ns for read access between cores from different clusters. 展开更多
关键词 many-core MULTI-CORE N-conjugate shuffle multi-port content addressable memory interconnection network
在线阅读 下载PDF
Typhoon Case Comparison Analysis Between Heterogeneous Many-Core and Homogenous Multicore Supercomputing Platforms
7
作者 LIU Xin YU Xiaolin +5 位作者 ZHAO Haoran HAN Qiqi ZHANG Jie WANG Chengzhi MA Weiwei XU Da 《Journal of Ocean University of China》 SCIE CAS CSCD 2023年第2期324-334,共11页
In this paper,a typical experiment is carried out based on a high-resolution air-sea coupled model,namely,the coupled ocean-atmosphere-wave-sediment transport(COAWST)model,on both heterogeneous many-core(SW)and homoge... In this paper,a typical experiment is carried out based on a high-resolution air-sea coupled model,namely,the coupled ocean-atmosphere-wave-sediment transport(COAWST)model,on both heterogeneous many-core(SW)and homogenous multicore(Intel)supercomputing platforms.We construct a hindcast of Typhoon Lekima on both the SW and Intel platforms,compare the simulation results between these two platforms and compare the key elements of the atmospheric and ocean modules to reanalysis data.The comparative experiment in this typhoon case indicates that the domestic many-core computing platform and general cluster yield almost no differences in the simulated typhoon path and intensity,and the differences in surface pressure(PSFC)in the WRF model and sea surface temperature(SST)in the short-range forecast are very small,whereas a major difference can be identified at high latitudes after the first 10 days.Further heat budget analysis verifies that the differences in SST after 10 days are mainly caused by shortwave radiation variations,as influenced by subsequently generated typhoons in the system.These typhoons generated in the hindcast after the first 10 days attain obviously different trajectories between the two platforms. 展开更多
关键词 heterogeneous many-core supercomputing platform homogenous multicore supercomputing platform comparison analysis typhoon case
在线阅读 下载PDF
Deep Packet Inspection Based on Many-Core Platform
8
作者 Ya-Ru Zhan Zhao-Shun Wang 《Journal of Computer and Communications》 2015年第5期1-6,共6页
With the development of computer technology, network bandwidth and network traffic continue to increase. Considering the large data flow, it is imperative to perform inspection effectively on network packets. In order... With the development of computer technology, network bandwidth and network traffic continue to increase. Considering the large data flow, it is imperative to perform inspection effectively on network packets. In order to find a solution of deep packet inspection which can appropriate to the current network environment, this paper built a deep packet inspection system based on many-core platform, and in this way, verified the feasibility to implement a deep packet inspection system under many-core platform with both high performance and low consumption. After testing and analysis of the system performance, it has been found that the deep packet inspection based on many-core platform TILE_Gx36 [1] [2] can process network traffic of which the bandwidth reaches up to 4 Gbps. To a certain extent, the performance has improved compared to most deep packet inspection system based on X86 platform at present. 展开更多
关键词 many-core PLATFORM Deep PACKET INSPECTION Application Layer PROTOCOL TILE_Gx36
在线阅读 下载PDF
Multiple Levels of Abstraction in the Simulation of Microthreaded Many-Core Architectures
9
作者 Irfan Uddin 《Open Journal of Modelling and Simulation》 2015年第4期159-190,共32页
Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration,... Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration, simulators with less complexity, high simulation speed and reasonable accuracy are desired. It is also required that these simulators have a short development time and that changes in the design require less effort in the implementation in order to perform experiments and see the effects of changes in the design. These simulators are termed high-level simulators in the context of computer architecture. In this paper, we present multiple levels of abstractions in a high-level simulation of a general-purpose many-core system, where the objective of every level is to improve the accuracy in simulation without significantly affecting the complexity and simulation speed. 展开更多
关键词 HIGH-LEVEL Simulations MULTIPLE LEVELS of ABSTRACTION Design Space Exploration many-core Systems
在线阅读 下载PDF
Zuchongzhi-3 Sets New Benchmark with 105-Qubit Superconducting Quantum Processor
10
作者 LIU Danxu GE Shuyun WU Yuyang 《Bulletin of the Chinese Academy of Sciences》 2025年第1期55-56,共2页
A team of researchers from the University of Science and Technology of China(USTC)of the Chinese Academy of Sciences(CAS)and its partners have made significant advancements in random quantum circuit sampling with Zuch... A team of researchers from the University of Science and Technology of China(USTC)of the Chinese Academy of Sciences(CAS)and its partners have made significant advancements in random quantum circuit sampling with Zuchongzhi-3,a superconducting quantum computing prototype featuring 105 qubits and 182 couplers. 展开更多
关键词 quantum circuit sampling superconducting quantum computing prototype zuchongzhi superconducting quantum processor QUBITS COUPLERS
在线阅读 下载PDF
基于PowerMILL PostProcessor的海德汉iTNC530系统PLANE指令后置处理研究
11
作者 康晓崇 《机械研究与应用》 2025年第5期102-107,共6页
后置处理在计算机辅助制造(CAM)与数控加工之间起到关键的桥梁作用,其性能直接影响加工精度和效率。该文基于PowerMILL后处理编辑器开发了一个针对海德汉iTNC530系统的后处理器,旨在实现PLANE指令的自动生成,以适应复杂的多轴加工任务... 后置处理在计算机辅助制造(CAM)与数控加工之间起到关键的桥梁作用,其性能直接影响加工精度和效率。该文基于PowerMILL后处理编辑器开发了一个针对海德汉iTNC530系统的后处理器,旨在实现PLANE指令的自动生成,以适应复杂的多轴加工任务。文章详细描述了开发流程,包括刀具方向向量的提取、旋转角度的计算以及PLANE指令的生成,并结合具体案例展示了如何应用数学模型与旋转矩阵进行刀具路径的优化控制。仿真验证结果表明,所开发的后置处理器能够生成高精度的数控程序,提高了加工的自动化程度和稳定性,可以为多轴加工中的后置处理开发提供实践指导和技术参考。 展开更多
关键词 后置处理开发 海德汉iTNC530 PLANE指令 数学模型
在线阅读 下载PDF
基于任务同步的异构多核实时系统节能调度算法
12
作者 赵小松 黄超 +1 位作者 李鉴 康玉龙 《计算机科学》 北大核心 2026年第1期241-251,共11页
目前,多核实时系统中同步任务的节能调度研究主要针对的是同构多核处理器平台,而异构多核处理器架构能够更有效地发挥系统性能。将现有的研究直接应用于异构多核系统,在保证可调度性的情况下会导致能耗变高。对此,通过使用动态电压与频... 目前,多核实时系统中同步任务的节能调度研究主要针对的是同构多核处理器平台,而异构多核处理器架构能够更有效地发挥系统性能。将现有的研究直接应用于异构多核系统,在保证可调度性的情况下会导致能耗变高。对此,通过使用动态电压与频率调节(Dynamic Voltage Frequency Scaling,DVFS)技术,研究异构多核实时系统中基于任务同步的节能调度问题,提出同步感知的最大能耗节省优先算法(Synchronization Aware-Largest Energy Saved First,SA-LESF)。该算法针对所有任务的速度配置进行迭代优化,直至所有任务均达到其最大限度节能的速度配置。此外,进一步提出基于动态松弛时间回收的同步感知最大能耗节省优先算法(Synchronization Aware-Largest Energy Saved First with Dynamic Reclamation,SA-LESF-DR)。该算法在保证实时任务可调度的同时,实施相应的回收策略,进一步降低系统能耗。实验结果表明,SA-LESF与SA-LESF-DR算法在能耗表现上具有优势,在相同任务集下,相比其他算法可节省高达30%的能耗。 展开更多
关键词 实时系统 异构多核处理器 任务同步 节能调度
在线阅读 下载PDF
国产ARM架构在计算机组成中的教学研究
13
作者 王龙翔 董小社 +4 位作者 张兴军 陈衡 王今雨 张利平 安健 《实验室科学》 2026年第1期178-182,188,共6页
在信息技术与创新领域蓬勃发展的今天,国产处理器的发展日益引人关注。探讨在这一背景下,基于国产处理器的计算机组成实验教学的改革与研究。通过分析当前国产处理器技术的发展趋势和应用现状,结合教学实践经验,提出了一套基于国产处理... 在信息技术与创新领域蓬勃发展的今天,国产处理器的发展日益引人关注。探讨在这一背景下,基于国产处理器的计算机组成实验教学的改革与研究。通过分析当前国产处理器技术的发展趋势和应用现状,结合教学实践经验,提出了一套基于国产处理器的计算机组成实验教学方案。充分考虑了国产处理器的特点和性能,设计了一系列符合教学目标和内容要求的实验项目,旨在培养学生的计算机组成理论知识和实践操作能力。同时,针对传统教学中五级流水线教学难度大,学生难以理解的问题,通过引入基于GEM5的实验项目,使学生能够更加直观理解五级流水线原理。本研究在教学实践中取得了良好效果,得到了学生的高度好评。未来,将进一步完善教学方案,推动国产处理器在计算机教育领域的应用与推广,为培养更多高素质信息技术人才做出贡献。 展开更多
关键词 信息技术应用创新产业 国产处理器 国产信息系统 计算机组成 实验教学
在线阅读 下载PDF
低空经济视域下旅客数据处理者的损害赔偿责任
14
作者 郝秀辉 李佳睿 《西北工业大学学报(社会科学版)》 2026年第1期112-119,共8页
旅客数据是低空载人运输与低空旅游消费的关键要素,但数据处理者侵害旅客数据权益的现象时有发生。明确旅客数据处理者的损害赔偿责任与纠纷解决进路,对推动低空经济高质量发展具有重要意义。当前,责任认定面临规范适配缺口、数据属性... 旅客数据是低空载人运输与低空旅游消费的关键要素,但数据处理者侵害旅客数据权益的现象时有发生。明确旅客数据处理者的损害赔偿责任与纠纷解决进路,对推动低空经济高质量发展具有重要意义。当前,责任认定面临规范适配缺口、数据属性多元及主体实力差异等现实困境。对此,旅客数据聚合风险为处理者义务设定提供了现实正当性依据,处理者的义务通过类型化措施得以落实。在认定损害赔偿责任时,核心在于以损害结果界定行为,客观判断因果关系,并依据义务履行情况判定过错,同时厘清免责边界。为推动旅客数据权益纠纷的解决,应构建由数据处理者主导、以风险分担为核心的数据使用机制。 展开更多
关键词 低空经济 旅客数据 数据处理者 赔偿责任
在线阅读 下载PDF
基于双DSP(Digital Signal Processor)结构的有源滤波器检测及控制系统 被引量:3
15
作者 孙建军 王晓峰 +2 位作者 汤洪海 查晓明 陈允平 《武汉大学学报(工学版)》 CAS CSCD 北大核心 2001年第3期55-59,共5页
简要介绍了DigitalSignalProcessor(DSP)的发展及其性能特点 ,详细讨论了一种利用双DSP构成的有源滤波器检测及控制系统的实现和基本结构及算法 .
关键词 有源滤波器 灵活电力系统 数字信号 单片机 控制系统
在线阅读 下载PDF
一种用于Multi-Processor测量系统的NOC结构的路由节点设计及性能评估 被引量:1
16
作者 武畅 李玉柏 彭启琮 《电子测量与仪器学报》 CSCD 2008年第5期101-106,共6页
本文提出了一种用于多处理器(Multi-Processor)测量系统的NOC结构的路由节点的微结构,并详细描述了路由节点的各个部分结构及其各自功能。为了说明本文提出的结构的可行性和实用性,本文设计了一套以DSP和FPGA为基础的用于NOC结构仿真的... 本文提出了一种用于多处理器(Multi-Processor)测量系统的NOC结构的路由节点的微结构,并详细描述了路由节点的各个部分结构及其各自功能。为了说明本文提出的结构的可行性和实用性,本文设计了一套以DSP和FPGA为基础的用于NOC结构仿真的硬件平台,评估了路由节点的资源消耗。最后,本文通过16个路由节点建立了一个基于4×4Mesh拓扑结构的NOC。通过仿真,得到了该网络在不同通信模式下的不同注入率情况下的延时、吞吐率、和面积消耗等性能,并与采用输出缓冲的路由节点进行了比较。同时,针对VOQ(virtual output queue)和输出缓冲大小这两个影响网络性能的重要微结构参数,给出了比较和分析结果。 展开更多
关键词 NOC 路由节点 微结构 多处理器 仿真
在线阅读 下载PDF
A SMART COMPENSATION SYSTEM BASED ON MCA7707 PROCESSOR
17
作者 赵敏 姚敏 颜彦 《Transactions of Nanjing University of Aeronautics and Astronautics》 EI 2001年第1期97-101,共5页
This paper presents a smart compensation system based on MCA7707 (a kind of signal processor). The li near errors and high order errors of a sensor (especially piezoresistive sensor) can be corrected by using this s... This paper presents a smart compensation system based on MCA7707 (a kind of signal processor). The li near errors and high order errors of a sensor (especially piezoresistive sensor) can be corrected by using this system. It can optimize the process of piezoresi stive sensor calibration and compensation, then, a total error factor within 0.2 % of the sensor′s repeatability errors is obtained. Data are recorded and coeff icients are determined automatically by this system, thus, the sensor compensati on is simplified greatly. For operating easily, a wizard compensation program is designed to correct every error and to get the optimum compensation. 展开更多
关键词 MCA7707 processor temp erature compensation piezoresistive sensor
在线阅读 下载PDF
超导量子处理器芯片工艺线中金属污染问题的研究
18
作者 徐晓 张海斌 +9 位作者 宿非凡 严凯 荣皓 邓辉 杨新迎 马效腾 董学 王绮名 刘佳林 李满满 《物理学报》 北大核心 2026年第1期316-322,共7页
超导量子处理器芯片的制造工艺面临特殊的金属污染挑战,其材料体系和工艺特性与传统半导体芯片存在显著差异.本研究系统分析了量子芯片中金属污染的来源、扩散机制及防控策略,重点探讨了超导材料(如Ta,Nb,Al,TiN等)在蓝宝石和硅衬底上... 超导量子处理器芯片的制造工艺面临特殊的金属污染挑战,其材料体系和工艺特性与传统半导体芯片存在显著差异.本研究系统分析了量子芯片中金属污染的来源、扩散机制及防控策略,重点探讨了超导材料(如Ta,Nb,Al,TiN等)在蓝宝石和硅衬底上的体扩散与表面扩散行为.研究发现,蓝宝石衬底因其致密晶格结构表现出优异的抗扩散性能,而硅衬底需重点关注Au,In,Sn等易迁移金属的污染风险.通过实验验证,Ti/Au结构的凸点下金属化层在硅衬底上易发生Au穿透扩散,且增加Ti层厚度无法显著改善阻挡效果.量子芯片的低温工艺(<250℃)和超低温工作环境(mK级)有效抑制了金属扩散,但暴露的金属表面和材料多样性仍带来独特挑战.研究建议建立量子芯片专属的金属污染防控体系,并提出了后续在新型材料评估、表面态调控及长期可靠性研究等方向的发展路径.本文为超导量子芯片的工艺优化和性能提升提供了重要理论支撑和技术指导. 展开更多
关键词 超导量子处理器芯片 工艺线金属污染 体扩散 表面扩散
在线阅读 下载PDF
处理器硅前性能评估仿真点全局贪心分配方法
19
作者 韩晨吉 薛峰 +2 位作者 吴钰轩 汪文祥 张福新 《高技术通讯》 北大核心 2026年第1期29-40,共12页
仿真点(simulation point,SimPoint)作为一种代表性采样技术被广泛应用于处理器硅前性能评估中。SimPoint为每个待评估的程序根据贝叶斯信息准则确定仿真点数目。然而,标准测试集内不同程序有着不同的行为复杂程度,需要不同数目的仿真... 仿真点(simulation point,SimPoint)作为一种代表性采样技术被广泛应用于处理器硅前性能评估中。SimPoint为每个待评估的程序根据贝叶斯信息准则确定仿真点数目。然而,标准测试集内不同程序有着不同的行为复杂程度,需要不同数目的仿真点来准确刻画其程序行为。SimPoint无法识别出不同程序间的复杂度差异,无法做到在总仿真点数目一定的情况下,将更多的仿真点分配给行为复杂的程序以降低这些程序的性能评估误差,将更少的仿真点分配给行为简单的程序而不损失这些程序的性能评估精度。由于没有在测试集内合理地进行仿真点分配,SimPoint虽然可以给出比较准确的平均性能评估误差,但是某些行为复杂的测试子项的性能评估误差依然较大。针对这一问题,本文优化了SimPoint的仿真点局部分配方式,提出了一种全局贪心分配方法———贪心点(greedy point,GreedyPoint)方法。该方法将仿真点的分配问题抽象为含约束的优化问题,使用微架构无关特征计算表征误差,通过全局贪心算法来求解该优化问题。实验数据表明,在相同仿真开销下,与SimPoint相比,GreedyPoint可以将SPEC CPU 2017测试套件的平均性能评估误差由3.23%降低到2.08%,最大性能评估误差由21.22%大幅降低至7.01%。 展开更多
关键词 处理器硅前性能评估 代表性采样 程序微架构无关特征
在线阅读 下载PDF
swDaCe:一种申威众核处理器上以数据为中心的并行编程模型设计与实现
20
作者 沈沛祺 陈俊仕 安虹 《小型微型计算机系统》 北大核心 2026年第3期751-759,共9页
高性能科学计算是超级计算机的核心应用领域,包括粒子模拟、气候分析等关键任务.然而,随着摩尔定律逐渐失效,超级计算机体系结构日益趋向异构和复杂,导致科学计算应用的开发和优化变得更加困难.为解决这一问题,本文基于新一代申威超级... 高性能科学计算是超级计算机的核心应用领域,包括粒子模拟、气候分析等关键任务.然而,随着摩尔定律逐渐失效,超级计算机体系结构日益趋向异构和复杂,导致科学计算应用的开发和优化变得更加困难.为解决这一问题,本文基于新一代申威超级计算平台,提出并实现了一种以数据为中心的并行编程模型——swDaCe.该模型通过解耦数据流图优化与原始程序,使得编程人员可以使用Python描述计算逻辑,并最终生成适配申威众核架构的高性能C++代码.此外,本文提出了一系列针对申威架构的数据流优化方法,包括从核任务映射、向量化并行以及DMA访存优化,以充分利用申威众核处理器的计算能力.实验结果表明,swDaCe生成的代码在稀疏矩阵计算等典型应用中实现了显著的性能提升,单核组加速比达到25倍以上,验证了该框架在申威架构上的有效性. 展开更多
关键词 新一代神威平台 异构众核处理器 数据流编程 并行计算 稀疏矩阵乘
在线阅读 下载PDF
上一页 1 2 250 下一页 到第
使用帮助 返回顶部