期刊文献+
共找到41篇文章
< 1 2 3 >
每页显示 20 50 100
Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
1
作者 邓林 窦勇 《Journal of Central South University》 SCIE EI CAS 2011年第2期490-498,共9页
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t... Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores. 展开更多
关键词 multi-core processor NAS parallelization CG memory optimization
在线阅读 下载PDF
Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on Multi-Core Processor
2
作者 Zhang Ziran,Li Jun,Li Changxiao(ZTE Corporation,Shenzhen 518057,P.R.China) 《ZTE Communications》 2009年第1期54-58,共5页
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co... The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well. 展开更多
关键词 CORE LTE Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on multi-core processor Design
在线阅读 下载PDF
Shared Cache Based on Content Addressable Memory in a Multi-Core Architecture
3
作者 Allam Abumwais Mahmoud Obaid 《Computers, Materials & Continua》 SCIE EI 2023年第3期4951-4963,共13页
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc... Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors. 展开更多
关键词 multi-core processor shared cache content addressable memory dual port CAM replacement algorithm benchmark program
在线阅读 下载PDF
基于共享总线互连的多核堆栈处理器架构设计
4
作者 陈林 周永录 +1 位作者 刘宏杰 代红兵 《计算机应用与软件》 北大核心 2025年第12期51-57,70,共8页
随着嵌入式系统的发展,单核堆栈处理器在开发成本、执行速度和功耗等方面已不能满足现实应用需求。为提升堆栈处理器性能,探索多核堆栈处理器价值,该文采用WISHBONE共享总线互连架构,通过对多核堆栈处理器架构、Forth系统指令、总线仲... 随着嵌入式系统的发展,单核堆栈处理器在开发成本、执行速度和功耗等方面已不能满足现实应用需求。为提升堆栈处理器性能,探索多核堆栈处理器价值,该文采用WISHBONE共享总线互连架构,通过对多核堆栈处理器架构、Forth系统指令、总线仲裁以及UART的设计,初步构建一种基于共享总线互连的多核堆栈处理器。该处理器运用Verilog和VHDL语言进行结构描述,使用ISim工具进行功能仿真,最终在FPGA芯片上实现。实验结果表明,该设计使用有效总线仲裁,以较低的硬件开销和功耗获得了较高的计算性能,为多核堆栈处理器架构的深入研究与应用奠定了良好基础。 展开更多
关键词 Forth系统 堆栈处理器 多核处理器 总线仲裁
在线阅读 下载PDF
四级流水线堆栈处理器研究与设计
5
作者 朱恒宇 周永录 +1 位作者 刘宏杰 代红兵 《计算机工程与设计》 北大核心 2025年第1期265-273,共9页
针对现有堆栈处理器主频较低的问题,设计一种16位的四级流水线堆栈处理器ZP16。采用冯诺伊曼结构与J1指令集,具有数据堆栈和返回堆栈两个独立堆栈。四级流水线包括取指、译码、执行和回写。通过合理的结构设计与流水线冲刷技术解决ZP16... 针对现有堆栈处理器主频较低的问题,设计一种16位的四级流水线堆栈处理器ZP16。采用冯诺伊曼结构与J1指令集,具有数据堆栈和返回堆栈两个独立堆栈。四级流水线包括取指、译码、执行和回写。通过合理的结构设计与流水线冲刷技术解决ZP16中流水线冒险问题。实验结果表明,在Xilinx XC7A100T FPGA目标芯片上,ZP16的运行主频稳定在230 MHz。与J1堆栈处理器相比,ZP16流水线加速比为1.3,资源占用率基本相当,功耗增加8%,主频提升130%。与其它同类型堆栈处理器在不同的目标芯片上进行比较,ZP16主频有较为明显的提升。 展开更多
关键词 堆栈处理器 流水线 现场可编程门阵列 主频 加速比 资源占用率 功耗
在线阅读 下载PDF
System Architecture of Godson-3 Multi-Core Processors 被引量:7
6
作者 高翔 陈云霁 +2 位作者 王焕东 唐丹 胡伟武 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期181-191,共11页
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa... Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem. 展开更多
关键词 multi-core processor scalable interconnection cache coherent non-uniform memory access/non-uniform cache access (CC-NUMA/NUCA) MESH CROSSBAR cache coherence reliability availability and serviceability (RAS)
原文传递
多核堆栈处理器中多核调度机制研究与设计
7
作者 刘自昂 周永录 +1 位作者 代红兵 刘宏杰 《计算机应用与软件》 北大核心 2025年第9期263-269,共7页
多核堆栈处理器作为Forth领域的研究热点之一,目前已取得了一定的研究进展,但多核堆栈处理器面临着缺乏高效Forth系统支撑的问题。针对Forth多核堆栈处理器的特性,研究并设计一种多核调度机制,该多核调度机制的多核调度算法使用全局调度... 多核堆栈处理器作为Forth领域的研究热点之一,目前已取得了一定的研究进展,但多核堆栈处理器面临着缺乏高效Forth系统支撑的问题。针对Forth多核堆栈处理器的特性,研究并设计一种多核调度机制,该多核调度机制的多核调度算法使用全局调度,Forth任务调度算法使用可变时间片轮转调度算法和EDF(Earliest Deadline First)调度算法,重点解决多核堆栈处理器平台的Forth任务调度问题。实验表明,多核调度机制能够在基于FPGA实现和工作于100 MHz频率的多核堆栈处理器上可靠运行,实现任务的正确调度,普通任务响应时间最低为0.5 ms,实时任务的平均响应最长为9.36μs。 展开更多
关键词 多核堆栈处理器 Forth系统 多核调度机制 全局调度 可变时间片轮转调度算法 EDF调度算法
在线阅读 下载PDF
Parallel computing of discrete element method on multi-core processors 被引量:6
8
作者 Yusuke Shigeto Mikio Sakai 《Particuology》 SCIE EI CAS CSCD 2011年第4期398-405,共8页
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ... This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU. 展开更多
关键词 Discrete element method Parallel computing multi-core processor GPGPU
原文传递
Energy Efficiency of a Multi-Core Processor by Tag Reduction
9
作者 郑龙 董冕雄 +3 位作者 Kaoru Ota 金海 Song Guo 马俊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期491-503,共13页
We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This p... We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor. 展开更多
关键词 tag reduction multi-core processor energy efficiency
原文传递
Schedule refinement for homogeneous multi-core processors in the presence of manufacturing-caused heterogeneity
10
作者 Zhi-xiang CHEN Zhao-lin LI +2 位作者 Shan CAO Fang WANG Jie ZHOU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第12期1018-1033,共16页
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin... Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%. 展开更多
关键词 Schedule refining multi-core processor HETEROGENEITY Representative chip operating point
原文传递
功耗感知下基于堆栈处理器的存取任务调度系统
11
作者 梁锦来 骆国铭 《电子设计工程》 2025年第7期172-176,共5页
不同存取任务功耗情况影响了最终的调度效果,导致存取任务调度响应时间与实际不符。为此,在功耗感知下设计基于堆栈处理器的存取任务调度系统。硬件结构使用基于StackProcessor-1000的堆栈处理器和FlashMemory-2000的存取设备,输入到基... 不同存取任务功耗情况影响了最终的调度效果,导致存取任务调度响应时间与实际不符。为此,在功耗感知下设计基于堆栈处理器的存取任务调度系统。硬件结构使用基于StackProcessor-1000的堆栈处理器和FlashMemory-2000的存取设备,输入到基于CPU的任务调度设备中,以帮助系统执行数据存取操作,同时合理调度和管理任务,确保系统稳定运行。软件部分在计算任务于最大电压下的堆栈执行时间,获取堆栈处理器最大功率下任务产生能耗。依据功耗结果,获取调度任务开始时间和结束时间,并根据优先级判断的结果,更新堆栈中任务信息,保证任务正确调度和执行。测试结果表明,设计系统任务调度与实际调度存在最大为1 min的误差,优化了存取任务调度效果。 展开更多
关键词 功耗感知 堆栈处理器 存取任务 调度系统 优先级判断
在线阅读 下载PDF
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
12
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique Sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
13
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
基于DM9000A的嵌入式以太网接口设计与实现 被引量:26
14
作者 施勇 温阳东 《合肥工业大学学报(自然科学版)》 CAS CSCD 北大核心 2011年第4期519-524,共6页
文章提出了一种基于32位ARM处理器LPC2468和以太网控制器DM9000A的嵌入式以太网接口设计方法。硬件方面主要涉及以太网网络接口电路的设计,软件方面主要涉及以太网控制芯片驱动程序和上层网路协议。该嵌入式系统网络接入方案具有硬件接... 文章提出了一种基于32位ARM处理器LPC2468和以太网控制器DM9000A的嵌入式以太网接口设计方法。硬件方面主要涉及以太网网络接口电路的设计,软件方面主要涉及以太网控制芯片驱动程序和上层网路协议。该嵌入式系统网络接入方案具有硬件接口简单、外围器件少、价格低廉、开发周期短等特点。 展开更多
关键词 嵌入式系统 LPC2468处理器 DM9000A控制器 网络驱动 TCP/IP网络协议栈
在线阅读 下载PDF
基于SOPC的以太网实时数据采集系统设计与实现 被引量:5
15
作者 梅大成 柴志勇 《计算机应用》 CSCD 北大核心 2009年第B12期108-109,112,共3页
设计了一个基于SOPC技术的实时数据采集系统。系统采用NiosⅡ软核处理器为主控制器,以嵌入式实时操作系统μC/OS-Ⅱ为软件运行平台,以LWIP为以太网通信协议,实现了数据采集系统的以太网传输及控制。整个系统在CycloneⅡ EP2C35开发板上... 设计了一个基于SOPC技术的实时数据采集系统。系统采用NiosⅡ软核处理器为主控制器,以嵌入式实时操作系统μC/OS-Ⅱ为软件运行平台,以LWIP为以太网通信协议,实现了数据采集系统的以太网传输及控制。整个系统在CycloneⅡ EP2C35开发板上实现并通过验证。 展开更多
关键词 NiosⅡ软核处理器 SOPC μC/OS-Ⅱ LWIP协议栈 实时数据采集
在线阅读 下载PDF
分支指令特性与分支预测器的性能研究 被引量:1
16
作者 喻明艳 张祥建 王晨旭 《微电子学与计算机》 CSCD 北大核心 2010年第6期8-12,共5页
根据分支指令的特性,分析了分支行为与分支预测技术对单发射嵌入式处理器CPI栈(CPI stacks)组成的影响,并在RTL级设计了分支预测器的时序精确模型,通过硬件模拟方法对分支指令特性和分支预测器的性能进行了研究.实验考察了分支指令在分... 根据分支指令的特性,分析了分支行为与分支预测技术对单发射嵌入式处理器CPI栈(CPI stacks)组成的影响,并在RTL级设计了分支预测器的时序精确模型,通过硬件模拟方法对分支指令特性和分支预测器的性能进行了研究.实验考察了分支指令在分支预测器命中或缺失时的不同跳转统计特性,验证了分支预测器对CPI栈影响的理论推导,为单发射嵌入式处理器中分支预测器的设计与优化提供了精确的实验依据. 展开更多
关键词 CPI栈 分支预测器 单发射嵌入式处理器 硬件模型
在线阅读 下载PDF
NP防火墙协议栈驱动模块的设计与实现 被引量:1
17
作者 韩志耕 罗军舟 《计算机工程》 EI CAS CSCD 北大核心 2006年第21期136-138,共3页
彻底打通网络处理器光口到本地协议栈间通路需要协议栈驱动提供支持。针对协议栈驱动基本组成和内在驱动机制,同时确保遵循Intel IXA软件架构分层设计原则,该文提出了在Linux平台上的实现方案并进行了分析,指出了实现过程中牵涉的关键... 彻底打通网络处理器光口到本地协议栈间通路需要协议栈驱动提供支持。针对协议栈驱动基本组成和内在驱动机制,同时确保遵循Intel IXA软件架构分层设计原则,该文提出了在Linux平台上的实现方案并进行了分析,指出了实现过程中牵涉的关键技术。Enp2611评估板上硬件光口打通测试表明设计达到了预先要求。 展开更多
关键词 协议栈驱动 防火墙 网络处理器 包分类 主动式安全防范系统
在线阅读 下载PDF
基于NP策略路由中源地址路由功能的设计与实现 被引量:2
18
作者 易著梁 《广西民族大学学报(自然科学版)》 CAS 2013年第3期64-67,共4页
阐述了一种基于网络处理器的源地址路由解决方案.该方案能够在不影响IP报文的承载效率的情况下,透明的实现大容量报文的转发能力,是一种行之有效的方案.
关键词 源地址路由 网络处理器 IP协议栈
在线阅读 下载PDF
C环境下DSP程序存储空间访问技术 被引量:2
19
作者 易龙强 戴瑜兴 《湖南工程学院学报(自然科学版)》 2006年第4期1-3,19,共4页
针对TMS320C2xx系列DSP的C编译器未提供程序存储器数据操作的C运行库函数的问题,介绍了该项技术的解决方法.通过介绍函数功能实现所用汇编指令以及TI的C编译环境软堆栈结构和C语言调用规范,详细描述了C可调用DSP程序存储空间访问技术的... 针对TMS320C2xx系列DSP的C编译器未提供程序存储器数据操作的C运行库函数的问题,介绍了该项技术的解决方法.通过介绍函数功能实现所用汇编指令以及TI的C编译环境软堆栈结构和C语言调用规范,详细描述了C可调用DSP程序存储空间访问技术的程序实现方法.该技术可用于具有大量数据常量的工程应用中,以解决其数据存储单元资源紧缺问题.利用该技术还可以在程序存储空间上开辟一段空间用作非易失性存储空间存储用户掉电保护数据,这样有利于简化系统并提高系统性能.实践证明,该技术具有极高的实用价值. 展开更多
关键词 DSP C编译器 堆栈
在线阅读 下载PDF
基于网络处理器的新型IPv6转发系统的设计与实现
20
作者 苏金树 时向泉 吴纯青 《国防科技大学学报》 EI CAS CSCD 北大核心 2005年第5期6-11,共6页
转发与控制分离结构的提出和网络处理器的发展对路由器的扩展性、灵活性、性能具有重要的影响,而IPv6作为下一代互联网协议的核心,是路由器研究的重要对象。简要阐述了基于转发与控制分离结构ForCES的IPv6路由器的系统结构,重点论述了... 转发与控制分离结构的提出和网络处理器的发展对路由器的扩展性、灵活性、性能具有重要的影响,而IPv6作为下一代互联网协议的核心,是路由器研究的重要对象。简要阐述了基于转发与控制分离结构ForCES的IPv6路由器的系统结构,重点论述了基于网络处理器的IPv6路由器的转发结构、双栈转发系统的流程设计和隧道机制设计的实现,给出IPv6路由器原型系统的实际测试结果。 展开更多
关键词 IPV6 转发与控制分离 网络处理器 双栈 隧道
在线阅读 下载PDF
上一页 1 2 3 下一页 到第
使用帮助 返回顶部