期刊文献+
共找到112篇文章
< 1 2 6 >
每页显示 20 50 100
Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
1
作者 邓林 窦勇 《Journal of Central South University》 SCIE EI CAS 2011年第2期490-498,共9页
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t... Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores. 展开更多
关键词 multi-core processor NAS parallelization CG memory optimization
在线阅读 下载PDF
Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on Multi-Core Processor
2
作者 Zhang Ziran,Li Jun,Li Changxiao(ZTE Corporation,Shenzhen 518057,P.R.China) 《ZTE Communications》 2009年第1期54-58,共5页
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co... The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well. 展开更多
关键词 CORE LTE Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on multi-core processor Design
在线阅读 下载PDF
A VLIW Architecture Stream Cryptographic Processor for Information Security 被引量:4
3
作者 Longmei Nan Xuan Yang +4 位作者 Xiaoyang Zeng Wei Li Yiran Du Zibin Dai Lin Chen 《China Communications》 SCIE CSCD 2019年第6期185-199,共15页
As an important branch of information security algorithms,the efficient and flexible implementation of stream ciphers is vital.Existing implementation methods,such as FPGA,GPP and ASIC,provide a good support,but they ... As an important branch of information security algorithms,the efficient and flexible implementation of stream ciphers is vital.Existing implementation methods,such as FPGA,GPP and ASIC,provide a good support,but they could not achieve a better tradeoff between high speed processing and high flexibility.ASIC has fast processing speed,but its flexibility is poor,GPP has high flexibility,but the processing speed is slow,FPGA has high flexibility and processing speed,but the resource utilization is very low.This paper studies a stream cryptographic processor which can efficiently and flexibly implement a variety of stream cipher algorithms.By analyzing the structure model,processing characteristics and storage characteristics of stream ciphers,a reconfigurable stream cryptographic processor with special instructions based on VLIW is presented,which has separate/cluster storage structure and is oriented to stream cipher operations.The proposed instruction structure can effectively support stream cipher processing with multiple data bit widths,parallelism among stream cipher processing with different data bit widths,and parallelism among branch control and stream cipher processing with high instruction level parallelism;the designed separate/clustered special bit registers and general register heaps,key register heaps can satisfy cryptographic requirements.So the proposed processor not only flexibly accomplishes the combination of multiple basic stream cipher operations to finish stream cipher algorithms.It has been implemented with 0.18μm CMOS technology,the test results show that the frequency can reach 200 MHz,and power consumption is 310 mw.Ten kinds of stream ciphers were realized in the processor.The key stream generation throughput of Grain-80,W7,MICKEY,ACHTERBAHN and Shrink algorithm is 100 Mbps,66.67 Mbps,66.67 Mbps,50 Mbps and 800 Mbps,respectively.The test result shows that the processor presented can achieve good tradeoff between high performance and flexibility of stream ciphers. 展开更多
关键词 stream CIPHER VLIW architecture processor RECONFIGURABLE application-specific instruction-set
在线阅读 下载PDF
Exploiting Loop-Carried Stream Reuse for Scientific Computing Applications on the Stream Processor
4
作者 Weixia XU Qiang DOU +2 位作者 Ying ZHANG Gen LI Xuejun YANG 《International Journal of Communications, Network and System Sciences》 2010年第1期32-37,共6页
Compared with other stream applications, scientific stream programs are usually bound by memory accesses. Reusing streams across different iterations, i.e. loop-carried stream reuse, can effectively improve the SRF lo... Compared with other stream applications, scientific stream programs are usually bound by memory accesses. Reusing streams across different iterations, i.e. loop-carried stream reuse, can effectively improve the SRF locality, thus reducing memory accesses greatly. In the paper, we first present the algorism identifying loop-carried stream reuse and that exploiting the reuse after analyzing scientific computing applications. We then perform several representative microbenchmarks and scientific stream programs with and without our optimization on Isim, a cycle-accurate stream processor simulator. Experimental results show that our algorithms can effectively exploit loop-carried stream reuse for scientific stream programs and thus greatly improve the performance of memory-bound scientific stream programs. 展开更多
关键词 stream REUSE Loop-Carried stream processor
暂未订购
Shared Cache Based on Content Addressable Memory in a Multi-Core Architecture
5
作者 Allam Abumwais Mahmoud Obaid 《Computers, Materials & Continua》 SCIE EI 2023年第3期4951-4963,共13页
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc... Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors. 展开更多
关键词 multi-core processor shared cache content addressable memory dual port CAM replacement algorithm benchmark program
在线阅读 下载PDF
StreamJacobi: Efficient implementation of 2-D Jacobi on a stream processor
6
作者 ZHANG Ying LI Gen YANG Xue-jun ZHANG Wei FANG Xu-dong ZHANG He-ying 《通讯和计算机(中英文版)》 2009年第1期6-15,共10页
关键词 软件 程序设计 编程语言 处理器
在线阅读 下载PDF
面向FT1000微处理器的STREAM并行计算与优化 被引量:4
7
作者 迟利华 胡庆丰 +3 位作者 刘杰 甘新标 蒋杰 晏益慧 《计算机工程与科学》 CSCD 北大核心 2014年第12期2267-2271,共5页
STREAM是微处理器上内存性能的基准测试程序,在多核多线程FT1000微处理器上发挥高性能是具有挑战性的研究工作。基于多级Cache结构,优化STREAM四个程序的指令流水线,根据寄存器数,设计了多级循环展开方法,根据指令延迟和Cache行的大小... STREAM是微处理器上内存性能的基准测试程序,在多核多线程FT1000微处理器上发挥高性能是具有挑战性的研究工作。基于多级Cache结构,优化STREAM四个程序的指令流水线,根据寄存器数,设计了多级循环展开方法,根据指令延迟和Cache行的大小确定数据预取的数目,使用汇编语言编写了优化子程序。基于OpenMP并行环境,设计了STREAM并行程序,优化了局部化数据分配方式。数据测试结果表明,优化后的STREAM的性能比原始串行程序性能提高了19.2%-64.2%。优化后,并行程序的最高访存性能达到8.5GB/s,对比优化前的最高访存性能最大提高了22.7%。 展开更多
关键词 多线程微处理器 stream测试程序 性能优化
在线阅读 下载PDF
龙腾Stream流处理器验证 被引量:1
8
作者 白龙飞 樊晓桠 +1 位作者 张萌 孙立超 《计算机工程与应用》 CSCD 2013年第15期65-69,共5页
芯片设计复杂度的提高迫切地需要先进的方法学以应对巨大的验证工作量。通过开发基于System Verilog的覆盖率驱动的自动化验证平台,对龙腾Stream流处理器的指令集进行了功能验证。实验结果表明,该验证平台提高了验证效率和功能覆盖率,... 芯片设计复杂度的提高迫切地需要先进的方法学以应对巨大的验证工作量。通过开发基于System Verilog的覆盖率驱动的自动化验证平台,对龙腾Stream流处理器的指令集进行了功能验证。实验结果表明,该验证平台提高了验证效率和功能覆盖率,具有良好的重用性和可移植性。搭建FPGA原型验证系统对流处理器的功能和系统性能进行了评测,并提出了优化流处理器加速性能的方法。 展开更多
关键词 流处理器 指令集验证 System VERILOG 现场可编程门阵列(FPGA)原型验证
在线阅读 下载PDF
System Architecture of Godson-3 Multi-Core Processors 被引量:7
9
作者 高翔 陈云霁 +2 位作者 王焕东 唐丹 胡伟武 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期181-191,共11页
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa... Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem. 展开更多
关键词 multi-core processor scalable interconnection cache coherent non-uniform memory access/non-uniform cache access (CC-NUMA/NUCA) MESH CROSSBAR cache coherence reliability availability and serviceability (RAS)
原文传递
Parallel computing of discrete element method on multi-core processors 被引量:6
10
作者 Yusuke Shigeto Mikio Sakai 《Particuology》 SCIE EI CAS CSCD 2011年第4期398-405,共8页
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ... This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU. 展开更多
关键词 Discrete element method Parallel computing multi-core processor GPGPU
原文传递
Schedule refinement for homogeneous multi-core processors in the presence of manufacturing-caused heterogeneity
11
作者 Zhi-xiang CHEN Zhao-lin LI +2 位作者 Shan CAO Fang WANG Jie ZHOU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第12期1018-1033,共16页
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin... Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%. 展开更多
关键词 Schedule refining multi-core processor HETEROGENEITY Representative chip operating point
原文传递
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
12
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique Sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
13
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
x86处理器向量条件访存指令安全脆弱性分析
14
作者 李丹萍 朱子元 +1 位作者 史岗 孟丹 《计算机学报》 EI CAS CSCD 北大核心 2024年第3期525-543,共19页
单指令多数据流(Single Instruction stream,Multiple Data streams,SIMD)是一种利用数据级并行提高处理器性能的技术,旨在利用多个处理器并行执行同一条指令增加数据处理的吞吐量.随着大数据、人工智能等技术的兴起,人们对数据并行化... 单指令多数据流(Single Instruction stream,Multiple Data streams,SIMD)是一种利用数据级并行提高处理器性能的技术,旨在利用多个处理器并行执行同一条指令增加数据处理的吞吐量.随着大数据、人工智能等技术的兴起,人们对数据并行化处理的需求不断提高,这使得SIMD技术愈发重要.为了支持SIMD技术,Intel和AMD等x86处理器厂商从1996年开始在其处理器中陆续引入了MMX(MultiMedia Extensions)、SSE(Streaming SIMD Extensions)、AVX(Advanced Vector eXtensions)等SIMD指令集扩展.通过调用SIMD指令,程序员能够无需理解SIMD技术的硬件层实现细节就方便地使用它的功能.然而,随着熔断、幽灵等处理器硬件漏洞的发现,人们逐渐认识到并行优化技术是一柄双刃剑,它在提高性能的同时也能带来安全风险.本文聚焦于x86 SIMD指令集扩展中的VMASKMOV指令,对它的安全脆弱性进行了分析.本文的主要贡献如下:(1)利用时间戳计数器等技术对VMASKMOV指令进行了微架构逆向工程,首次发现VMASKMOV指令与内存页管理和CPU Fill Buffer等安全风险的相关性;(2)披露了一个新的处理器漏洞EvilMask,它广泛存在于Intel和AMD处理器上,并提出了3个EvilMask攻击原语:VMASKMOVL+Time(MAP)、VMASKMOVS+Time(XD)和VMASKMOVL+MDS,可用于实施去地址空间布局随机化攻击和进程数据窃取攻击;(3)给出了2个EvilMask概念验证示例(Proof-of-Concept,PoC)验证了EvilMask对真实世界的信息安全危害;(4)讨论了针对EvilMask的防御方案,指出最根本的解决方法是在硬件层面上重新实现VMASKMOV指令,并给出了初步的实现方案. 展开更多
关键词 处理器安全 单指令多数据流(SIMD) 微体系结构侧信道攻击 VMASKMOV指令 地址空间布局随机化(ASLR)
在线阅读 下载PDF
Energy Efficiency of a Multi-Core Processor by Tag Reduction
15
作者 郑龙 董冕雄 +3 位作者 Kaoru Ota 金海 Song Guo 马俊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期491-503,共13页
We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This p... We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor. 展开更多
关键词 tag reduction multi-core processor energy efficiency
原文传递
Imagine流处理器上流的优化组织方法 被引量:4
16
作者 杨学军 曾丽芳 +1 位作者 邓宇 唐玉华 《计算机学报》 EI CSCD 北大核心 2008年第7期1092-1100,共9页
流应用的特点以及传统处理器在处理流应用上的不足,使得支持数据并行的流处理器的设计成为当前体系结构研究领域的一个热点.文中针对Imagine流处理器体系结构的特点,提出了流分割和流压缩两种流的优化组织方法.模拟结果表明,流分割和流... 流应用的特点以及传统处理器在处理流应用上的不足,使得支持数据并行的流处理器的设计成为当前体系结构研究领域的一个热点.文中针对Imagine流处理器体系结构的特点,提出了流分割和流压缩两种流的优化组织方法.模拟结果表明,流分割和流压缩使得流应用程序能充分利用Imagine的并行结构、流水结构和多级带宽存储结构,从而减少流程序的执行时间. 展开更多
关键词 Imagine流处理器 流应用 流优化 流分割 流压缩
在线阅读 下载PDF
基于图形处理器的数据流快速聚类 被引量:24
17
作者 曹锋 周傲英 《软件学报》 EI CSCD 北大核心 2007年第2期291-302,共12页
在数据流环境下,聚类算法不仅需要有较高的聚类质量,同时需要有实时处理速度.因而,提出了一类基于图形处理器(graphics processing unit,简称GPU)的快速聚类方法,包括基于K-means的基本聚类方法、基于GPU的数据流聚类以及数据流簇进化... 在数据流环境下,聚类算法不仅需要有较高的聚类质量,同时需要有实时处理速度.因而,提出了一类基于图形处理器(graphics processing unit,简称GPU)的快速聚类方法,包括基于K-means的基本聚类方法、基于GPU的数据流聚类以及数据流簇进化分析方法.这些方法的共同特点是充分利用了GPU强大的处理能力和流水线特性.与以往具有独立框架的数据流聚类算法不同,这些基于GPU的聚类算法具有同一框架和多种聚类分析功能,为数据流聚类分析提供了统一的平台.从分析可知,数据流聚类分析的核心操作实际上就是距离计算和比较.基于这一认识,利用GPU的子素向量处理功能进行距离计算.性能验证实验是在配有Pentium IV3.4G CPU和NVIDIA GeForce 6800 GT显卡的PC上进行的.综合分析和实验结果表明,基于GPU的数据流聚类算法比传统的CPU算法平均快7倍,从而为高速数据流应用提供了良好的支持. 展开更多
关键词 数据流 聚类 图形处理器 进化 窗口
在线阅读 下载PDF
一种流处理器体系结构MASA及其在流体力学计算中的评测 被引量:3
18
作者 伍楠 文梅 +4 位作者 何义 荀长庆 任巨 柴俊 张春元 《计算机学报》 EI CSCD 北大核心 2008年第1期133-141,共9页
提出了面向科学计算的64位流体系结构——MASA,它具有强局域性、并行性、解耦合访存操作和计算操作等特征,特别适合于计算密集型的并行应用.作者使用时钟精确的模拟器评测了流体力学中的典型应用在MASA上的运行性能,结果表明MASA在500MH... 提出了面向科学计算的64位流体系结构——MASA,它具有强局域性、并行性、解耦合访存操作和计算操作等特征,特别适合于计算密集型的并行应用.作者使用时钟精确的模拟器评测了流体力学中的典型应用在MASA上的运行性能,结果表明MASA在500MHz的情况下能够获得比1.6GHz的Iantium2近4倍的加速,证实了流体系结构在高性能计算领域的极大潜力. 展开更多
关键词 流处理器 体系结构 科学计算 Ygx2 MASA
在线阅读 下载PDF
可配置流处理器核心级指令设计及相关编译技术研究 被引量:4
19
作者 何义 任巨 +3 位作者 杨乾明 管茂林 文梅 张春元 《计算机工程与科学》 CSCD 北大核心 2009年第11期40-44,共5页
针对目前微处理器面对通用性、高性能、功耗效率的矛盾,我们提出了可配置流处理器的解决方案。本文重点研究了可配置流处理器中核心级指令设计及相关的编译技术,其核心设计思想是根据应用的计算特征设计流处理器中的核心级指令集,从而... 针对目前微处理器面对通用性、高性能、功耗效率的矛盾,我们提出了可配置流处理器的解决方案。本文重点研究了可配置流处理器中核心级指令设计及相关的编译技术,其核心设计思想是根据应用的计算特征设计流处理器中的核心级指令集,从而降低指令集硬件资源的需求。 展开更多
关键词 可配置 指令集 流处理器 编译
在线阅读 下载PDF
基于流体系架构的分组密码处理器设计 被引量:2
20
作者 李功丽 戴紫彬 +3 位作者 徐进辉 王寿成 朱玉飞 冯晓 《计算机研究与发展》 EI CSCD 北大核心 2017年第12期2824-2833,共10页
为提升密码处理器性能,构建了密码处理器性能模型.基于该模型,提出多级资源共享、绑定前/后异或操作、最大化算法并行度等处理器性能提升技术,并根据性能提升技术确定了功能单元的种类和数量.然而功能单元不仅数量较多,而且在操作位宽... 为提升密码处理器性能,构建了密码处理器性能模型.基于该模型,提出多级资源共享、绑定前/后异或操作、最大化算法并行度等处理器性能提升技术,并根据性能提升技术确定了功能单元的种类和数量.然而功能单元不仅数量较多,而且在操作位宽和操作延迟方面均有较大差异,如何有效组织这些功能单元成为了一个关键问题.利用流体系结构可以高效集成大量功能单元的特点,设计并实现了基于流体系结构的可重构分组密码处理器原型,并通过把功能单元划分为基本处理单元,bank间共享单元和簇间共享单元3个层次来解决功能单元处理位宽和操作延迟的差异.在65nm CMOS工艺下对处理器原型进行综合,并在该结构上映射了典型的分组密码算法.实验结果证明:该处理器以较小的面积获得了较高的性能,对典型分组密码算法的处理速度,不仅超越了国际上的密码专用指令处理器,而且高于国内可重构阵列结构密码处理器. 展开更多
关键词 分组密码 流处理器 性能模型 可重构 密码处理器
在线阅读 下载PDF
上一页 1 2 6 下一页 到第
使用帮助 返回顶部