期刊文献+
共找到170篇文章
< 1 2 9 >
每页显示 20 50 100
Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
1
作者 邓林 窦勇 《Journal of Central South University》 SCIE EI CAS 2011年第2期490-498,共9页
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t... Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores. 展开更多
关键词 multi-core processor NAS parallelization CG memory optimization
在线阅读 下载PDF
Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on Multi-Core Processor
2
作者 Zhang Ziran,Li Jun,Li Changxiao(ZTE Corporation,Shenzhen 518057,P.R.China) 《ZTE Communications》 2009年第1期54-58,共5页
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co... The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well. 展开更多
关键词 CORE LTE Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on multi-core processor Design
在线阅读 下载PDF
Optimization Task Scheduling Using Cooperation Search Algorithm for Heterogeneous Cloud Computing Systems 被引量:2
3
作者 Ahmed Y.Hamed M.Kh.Elnahary +1 位作者 Faisal S.Alsubaei Hamdy H.El-Sayed 《Computers, Materials & Continua》 SCIE EI 2023年第1期2133-2148,共16页
Cloud computing has taken over the high-performance distributed computing area,and it currently provides on-demand services and resource polling over the web.As a result of constantly changing user service demand,the ... Cloud computing has taken over the high-performance distributed computing area,and it currently provides on-demand services and resource polling over the web.As a result of constantly changing user service demand,the task scheduling problem has emerged as a critical analytical topic in cloud computing.The primary goal of scheduling tasks is to distribute tasks to available processors to construct the shortest possible schedule without breaching precedence restrictions.Assignments and schedules of tasks substantially influence system operation in a heterogeneous multiprocessor system.The diverse processes inside the heuristic-based task scheduling method will result in varying makespan in the heterogeneous computing system.As a result,an intelligent scheduling algorithm should efficiently determine the priority of every subtask based on the resources necessary to lower the makespan.This research introduced a novel efficient scheduling task method in cloud computing systems based on the cooperation search algorithm to tackle an essential task and schedule a heterogeneous cloud computing problem.The basic idea of thismethod is to use the advantages of meta-heuristic algorithms to get the optimal solution.We assess our algorithm’s performance by running it through three scenarios with varying numbers of tasks.The findings demonstrate that the suggested technique beats existingmethods NewGenetic Algorithm(NGA),Genetic Algorithm(GA),Whale Optimization Algorithm(WOA),Gravitational Search Algorithm(GSA),and Hybrid Heuristic and Genetic(HHG)by 7.9%,2.1%,8.8%,7.7%,3.4%respectively according to makespan. 展开更多
关键词 heterogeneous processors cooperation search algorithm task scheduling cloud computing
在线阅读 下载PDF
Shared Cache Based on Content Addressable Memory in a Multi-Core Architecture
4
作者 Allam Abumwais Mahmoud Obaid 《Computers, Materials & Continua》 SCIE EI 2023年第3期4951-4963,共13页
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc... Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors. 展开更多
关键词 multi-core processor shared cache content addressable memory dual port CAM replacement algorithm benchmark program
在线阅读 下载PDF
Schedule refinement for homogeneous multi-core processors in the presence of manufacturing-caused heterogeneity
5
作者 Zhi-xiang CHEN Zhao-lin LI +2 位作者 Shan CAO Fang WANG Jie ZHOU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第12期1018-1033,共16页
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin... Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%. 展开更多
关键词 Schedule refining multi-core processor heterogenEITY Representative chip operating point
原文传递
YHFT-QDSP:High-Performance Heterogeneous Multi-Core DSP
6
作者 陈书明 万江华 +8 位作者 鲁建壮 刘仲 孙海燕 孙永节 刘衡竹 刘祥远 李振涛 徐毅 陈小文 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期214-224,共11页
Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal pro... Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal processing, this paper presents a novel heterogeneous multi-core architecture digital signal processor (DSP), YHFT-QDSP, with one RISC CPU core and 4 VLIW DSP cores. By three kinds of interconnection, YHFT-QDSP provides high efficiency message communication for inner-chip RISC core and DSP cores, inner-chip and inter-chip DSP cores. A parallel programming platform is specifically developed for the heterogeneous nmlti-core architecture of YHFT-QDSP. This parallel programming environment provides a parallel support library and a friendly interface between high level application softwares and multi- core DSP. The 130 nm CMOS custom chip design results benchmarks show that the interconnection structure of in a high speed and moderate power design. The results of typical YHFT-QDSP is much better than other related structures and achieves better speedup when using the interconnection facilities in combing methods. YHFT-QDSP has been signed off and manufactured presently. The future applications of the multi-core chip could be found in 3G wireless base station, high performance radar, industrial applications, and so on. 展开更多
关键词 digital signal processor (DSP) multi-core ARCHITECTURE parallel programming custom design
原文传递
面向智能物联网异构嵌入式芯片的自适应算子并行分割方法 被引量:1
7
作者 林政 刘思聪 +2 位作者 郭斌 丁亚三 於志文 《计算机科学》 北大核心 2025年第2期299-309,共11页
随着人民生活质量的持续提升与科技发展的日新月异,智能手机等移动设备在全球范围内得到了广泛普及。在这一背景下,深度神经网络在移动端的部署与应用成为了研究的热点。深度神经网络不仅推动了移动应用领域的显著进步,同时也对使用电... 随着人民生活质量的持续提升与科技发展的日新月异,智能手机等移动设备在全球范围内得到了广泛普及。在这一背景下,深度神经网络在移动端的部署与应用成为了研究的热点。深度神经网络不仅推动了移动应用领域的显著进步,同时也对使用电池供电的移动设备的能效管理提出了更高要求。当今移动设备中异构处理器的兴起给优化能效带来了新的挑战,在不同处理器间分配计算任务以实现深度神经网络并行处理和加速,并不一定能够优化能耗,甚至可能会增加能耗。针对这一问题,提出了一种能效优化的深度神经网络自适应并行计算调度系统。该系统包括一个运行时能耗分析器与在线算子划分执行器,能够根据动态设备条件动态调整算子分配,在保持高响应性的同时,优化了移动设备异构处理器上的计算能效。实验结果证明,相比基准方法,能效优化的深度神经网络自适应并行计算调度系统在移动设备深度神经网络上的平均能耗和平均时延减少了5.19%和9.0%,最大能耗和最大时延减少了18.35%和21.6%。 展开更多
关键词 深度神经网络 移动设备 能效优化 异构处理器 能耗预测
在线阅读 下载PDF
一种新的异构多核平台下多类型DAG调度方法 被引量:1
8
作者 左俊杰 肖锋 +3 位作者 黄姝娟 沈超 郝鹏涛 陈磊 《计算机应用研究》 北大核心 2025年第2期514-518,共5页
异构多核处理器在异构环境中受限于处理器种类,只能在特定处理器上执行。现有调度方法通常使用多类型DAG(directed acyclic graph)任务模型进行模拟,但调度方法往往忽略不同核上的通信开销,或未考虑处理器与节点的对应关系,导致调度时... 异构多核处理器在异构环境中受限于处理器种类,只能在特定处理器上执行。现有调度方法通常使用多类型DAG(directed acyclic graph)任务模型进行模拟,但调度方法往往忽略不同核上的通信开销,或未考虑处理器与节点的对应关系,导致调度时间开销较大,处理器资源未充分利用,任务效率低。针对上述问题,提出了PNIF(processor-node impact factor)算法。该算法引入了两个对节点优先级具有重大影响的比例因子,将它们加入到节点优先级的计算中从而确定任务执行顺序。实验结果表明,PNIF比PEFT、HEFT、CPOP在调度长度上分别平均提升5.902%、19.402%、25.831%,有效缩短了整体调度长度,提升了处理器资源利用率。 展开更多
关键词 异构多核处理器 多类型DAG任务 任务调度 影响因子 PNIF算法
在线阅读 下载PDF
面向昇腾处理器的高性能同步原语自动插入方法
9
作者 李帅江 张馨元 +4 位作者 赵家程 田行辉 石曦予 徐晓忻 崔慧敏 《计算机研究与发展》 北大核心 2025年第8期1962-1978,共17页
指令级并行(instruction level parallism,ILP)是处理器体系结构研究的经典难题.以昇腾为代表的领域定制架构将更多的流水线细节暴露给上层软件,由编译器/程序员显式控制流水线之间的同步来优化ILP,但是流水线之间的物理同步资源是有限... 指令级并行(instruction level parallism,ILP)是处理器体系结构研究的经典难题.以昇腾为代表的领域定制架构将更多的流水线细节暴露给上层软件,由编译器/程序员显式控制流水线之间的同步来优化ILP,但是流水线之间的物理同步资源是有限的,限制了ILP的提升.针对这一问题,提出一种面向昇腾处理器的高性能同步原语自动插入方法,通过引入“虚拟同步资源”的抽象将同步原语的插入和物理同步资源的选择进行解耦.首先提出了一种启发式算法在复杂的控制流图上进行虚拟同步原语的插入,随后通过虚拟同步原语合并等技术,将虚拟同步资源映射到有限数量的物理同步资源上,并同时在满足程序正确性与严苛硬件资源限制的前提下,根据指令间的偏序关系删除程序中冗余的同步原语.使用指令级与算子级基准测试程序在昇腾910A平台上的实验表明,该方法自动插入同步原语的程序在保证正确性的基础上,整体性能与专家程序员手动插入同步原语接近或持平. 展开更多
关键词 昇腾处理器 同步原语 异构编程 领域定制架构 自动插入
在线阅读 下载PDF
Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture 被引量:13
10
作者 郑方 李宏亮 +3 位作者 吕晖 过锋 许晓红 谢向辉 《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第1期145-162,共18页
Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which h... Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS. 展开更多
关键词 heterogeneous many-core processor data stream transfer register-level communication mechanism hardwaresynchronization technique processor prototype
原文传递
面向天河新一代超算系统的大规模精确对角化方法
11
作者 李彪 刘杰 王庆林 《计算机研究与发展》 北大核心 2025年第6期1347-1362,共16页
精确对角化(exact diagonalization)方法是一种在量子物理、凝聚态物理等领域广泛应用的数值计算方法,是最直接求得量子系统基态的数值方法.仅从哈密顿矩阵的对称性出发,利用无矩阵(matrix-free)方法、分层通信模型以及适配于MT-3000的... 精确对角化(exact diagonalization)方法是一种在量子物理、凝聚态物理等领域广泛应用的数值计算方法,是最直接求得量子系统基态的数值方法.仅从哈密顿矩阵的对称性出发,利用无矩阵(matrix-free)方法、分层通信模型以及适配于MT-3000的数据级并行算法,提出了面向天河新一代超算系统上的超大稀疏哈密顿矩阵向量乘异构并行算法,可以实现基于一维Hubbard模型的大规模精确对角化.提出的并行算法在天河新一代超算系统上进行了测试,其中在1400亿维度矩阵规模上,8192进程相比256进程强扩展效率为55.27%,而弱扩展到7300亿维度矩阵规模上,13740个进程相比64进程的弱扩展效率保持在51.25%以上. 展开更多
关键词 精确对角化 HUBBARD模型 异构并行计算 MT-3000处理器 量子多体系统
在线阅读 下载PDF
System Architecture of Godson-3 Multi-Core Processors 被引量:7
12
作者 高翔 陈云霁 +2 位作者 王焕东 唐丹 胡伟武 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期181-191,共11页
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa... Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem. 展开更多
关键词 multi-core processor scalable interconnection cache coherent non-uniform memory access/non-uniform cache access (CC-NUMA/NUCA) MESH CROSSBAR cache coherence reliability availability and serviceability (RAS)
原文传递
一种板级异构核间多模通信的软硬件设计方法 被引量:2
13
作者 李锐 杜彬 王远波 《汽车电器》 2025年第6期100-102,共3页
随着车联网技术的高速发展和车载电控单元复杂性的提升,传统的单处理器难以满足数据交互与处理日益复杂和多样化的需求。文章提出一种板级异构核间多模通信机制,设计集成高实时性MCU和高性能SOC的硬件平台,并对异构多模通信的硬件结构... 随着车联网技术的高速发展和车载电控单元复杂性的提升,传统的单处理器难以满足数据交互与处理日益复杂和多样化的需求。文章提出一种板级异构核间多模通信机制,设计集成高实时性MCU和高性能SOC的硬件平台,并对异构多模通信的硬件结构进行阐述。在此基础上,提出分层、低耦合、高内聚的轻量级组件化软件设计方案,阐明驱动层、接口层、网络层、协议层、传输层和应用层的通信机制。该机制在提升异构多核环境运算效率的同时,实现处理器性能的优化,提高通信传输数据的品质。 展开更多
关键词 车联网 核间通信 MCU SOC 异构处理器
在线阅读 下载PDF
Parallel computing of discrete element method on multi-core processors 被引量:6
14
作者 Yusuke Shigeto Mikio Sakai 《Particuology》 SCIE EI CAS CSCD 2011年第4期398-405,共8页
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ... This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU. 展开更多
关键词 Discrete element method Parallel computing multi-core processor GPGPU
原文传递
Energy Efficiency of a Multi-Core Processor by Tag Reduction
15
作者 郑龙 董冕雄 +3 位作者 Kaoru Ota 金海 Song Guo 马俊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期491-503,共13页
We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This p... We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor. 展开更多
关键词 tag reduction multi-core processor energy efficiency
原文传递
一种异构多核系统动态调度协处理器设计
16
作者 曾树铭 倪伟 《合肥工业大学学报(自然科学版)》 北大核心 2025年第2期185-195,共11页
为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理... 为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理、任务自动映射、通讯任务乱序执行等机制。实验结果表明,该动态调度协处理器不仅能够实现任务级乱序执行等基本设计目标,还具有极低的调度开销,相较于基于动态记分牌算法的调度器,运行多个子孔径距离压缩算法的时间降低达17.13%。研究结果证明文章设计的动态调度协处理器能够有效优化目标场景下的任务调度效果。 展开更多
关键词 动态调度 硬件调度器 异构多核系统 任务级并行 编程模型 片上缓存 片上网络
在线阅读 下载PDF
面向人工智能的半导体加速单元架构设计
17
作者 孙彦德 《电子工业专用设备》 2025年第3期70-74,共5页
设计了一种适用于深度学习和大型语言模型的高效半导体加速单元架构。通过设计并行计算单元结构、建立多级片上存储体系、优化数据流传输以及实现异构系统互联与功耗管理等方法,构建了完整的加速器架构系统。实验结果表明,该架构在8 nm... 设计了一种适用于深度学习和大型语言模型的高效半导体加速单元架构。通过设计并行计算单元结构、建立多级片上存储体系、优化数据流传输以及实现异构系统互联与功耗管理等方法,构建了完整的加速器架构系统。实验结果表明,该架构在8 nm工艺下实现了3.8 TOPS/mm^(2)的计算密度和12.5 TOPS/W的功耗效率,可支持ResNet-50等典型神经网络模型的高效处理。研究证实,所提出的加速单元架构能够满足现代人工智能应用的计算需求,具有重要的实践价值。 展开更多
关键词 半导体技术 AI加速器架构 并行计算优化 神经网络处理器 片上存储系统 异构计算 功耗管理
在线阅读 下载PDF
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
18
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique Sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
19
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
基于异构多核处理器的嵌入式数控系统研究 被引量:11
20
作者 陆小虎 于东 +1 位作者 胡毅 林立明 《中国机械工程》 EI CAS CSCD 北大核心 2013年第19期2623-2628,共6页
针对传统嵌入式数控系统性能差、可扩展性差、人机界面不友好等特点,结合异构多核技术和现场总线技术的优点,提出并开发了一种基于异构处理器和现场总线技术的嵌入式数控系统。该数控系统运行在异构多核处理器之上,通过在不同的处理器... 针对传统嵌入式数控系统性能差、可扩展性差、人机界面不友好等特点,结合异构多核技术和现场总线技术的优点,提出并开发了一种基于异构处理器和现场总线技术的嵌入式数控系统。该数控系统运行在异构多核处理器之上,通过在不同的处理器核心上同时运行通用系统和实时系统,采用静态划分的方式将数控系统内部的任务分配到不同的处理器核心上,使用现场总线技术实现嵌入式数控系统与伺服电机之间的连接,简化数控系统与伺服驱动器之间的连线。实验证明,开发的数控系统具有良好的实时性和扩展性,验证了设计的合理性。 展开更多
关键词 嵌入式 数控系统 异构多核处理器 现场总线
在线阅读 下载PDF
上一页 1 2 9 下一页 到第
使用帮助 返回顶部