期刊文献+
共找到17篇文章
< 1 >
每页显示 20 50 100
Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
1
作者 邓林 窦勇 《Journal of Central South University》 SCIE EI CAS 2011年第2期490-498,共9页
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t... Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores. 展开更多
关键词 multi-core processor NAS parallelization CG memory optimization
在线阅读 下载PDF
Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on Multi-Core Processor
2
作者 Zhang Ziran,Li Jun,Li Changxiao(ZTE Corporation,Shenzhen 518057,P.R.China) 《ZTE Communications》 2009年第1期54-58,共5页
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co... The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well. 展开更多
关键词 CORE LTE Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on multi-core processor Design
在线阅读 下载PDF
Shared Cache Based on Content Addressable Memory in a Multi-Core Architecture
3
作者 Allam Abumwais Mahmoud Obaid 《Computers, Materials & Continua》 SCIE EI 2023年第3期4951-4963,共13页
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc... Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors. 展开更多
关键词 multi-core processor shared cache content addressable memory dual port CAM replacement algorithm benchmark program
在线阅读 下载PDF
基于RISC-V指令扩展方式的国密算法SM2、SM3和SM4的高效实现 被引量:6
4
作者 王明登 严迎建 +1 位作者 郭朋飞 张帆 《电子学报》 EI CAS CSCD 北大核心 2024年第8期2850-2865,共16页
基于指令扩展的密码算法实现是兼顾性能和面积的轻量级实现方式,特别适用于日益普及的物联网设备.SM2、SM3和SM4等国密算法有利于提高自主可控设备的安全性,但针对这些算法进行指令扩展的相关研究还不够充分.RISC-V由于其开源、简洁及... 基于指令扩展的密码算法实现是兼顾性能和面积的轻量级实现方式,特别适用于日益普及的物联网设备.SM2、SM3和SM4等国密算法有利于提高自主可控设备的安全性,但针对这些算法进行指令扩展的相关研究还不够充分.RISC-V由于其开源、简洁及可扩展等优点已成为业界最流行的指令集架构之一,本文主要基于国产开源RISC-V处理器对国密算法SM2、SM3和SM4进行指令扩展和高效实现.本文基于软硬件协同的理念提出总体指令的扩展方案.对相关密码算法进行深入分析和方案对比,分别设计了硬件单元,提出高效的实现方式.设计实现的协处理器具有2级流水线结构,顺序派遣、乱序执行和顺序写回的指令执行模式,以及独立内存访问单元和大位宽寄存器.协处理器统一接管了密码算法的部分控制逻辑,降低硬件资源消耗.实验结果表明,本文设计的密码协处理器硬件结构精简,资源利用率高.SM2、SM3和SM4算法占用资源少,但执行速率相比纯硬件有一定程度下降,资源面积和花费时间的乘积与其他相关文献相比有不同程度的优势. 展开更多
关键词 RISC-V 协处理器 国密算法 指令扩展 蜂鸟E203 嵌入式系统
在线阅读 下载PDF
基于Sylix操作系统的全国产化配电终端设计 被引量:2
5
作者 黄亮亮 王思麒 +3 位作者 徐鼎 温彦军 乔莉 陈建磊 《电子器件》 2024年第6期1563-1569,共7页
配电终端是保护配电网安全稳定运行的重要设备,为打破国外关键核心技术垄断局面,全国产化配电终端被提上日程。国产化应用所需的软硬件技术已逐步成熟,对基于国产化软硬件技术的配电终端进行研究,硬件方面基于国产化多核处理器设计了国... 配电终端是保护配电网安全稳定运行的重要设备,为打破国外关键核心技术垄断局面,全国产化配电终端被提上日程。国产化应用所需的软硬件技术已逐步成熟,对基于国产化软硬件技术的配电终端进行研究,硬件方面基于国产化多核处理器设计了国产化采样芯片的双通道模拟量采样电路、国产化存储ECC校核电路和抗干扰稳压电源电路等,软件方面基于国产化Sylix实时操作系统设计了多核软件平台架构和核间数据共享机制。通过终端精度测试、数据文件测试、电磁兼容测试,验证了基于国产化软硬件技术的配电终端的功能与性能。 展开更多
关键词 全国产化配电终端 sylix操作系统 国产多核处理器 核间共享数据
在线阅读 下载PDF
System Architecture of Godson-3 Multi-Core Processors 被引量:7
6
作者 高翔 陈云霁 +2 位作者 王焕东 唐丹 胡伟武 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期181-191,共11页
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa... Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem. 展开更多
关键词 multi-core processor scalable interconnection cache coherent non-uniform memory access/non-uniform cache access (CC-NUMA/NUCA) MESH CROSSBAR cache coherence reliability availability and serviceability (RAS)
原文传递
Parallel computing of discrete element method on multi-core processors 被引量:6
7
作者 Yusuke Shigeto Mikio Sakai 《Particuology》 SCIE EI CAS CSCD 2011年第4期398-405,共8页
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ... This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU. 展开更多
关键词 Discrete element method Parallel computing multi-core processor GPGPU
原文传递
Energy Efficiency of a Multi-Core Processor by Tag Reduction
8
作者 郑龙 董冕雄 +3 位作者 Kaoru Ota 金海 Song Guo 马俊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期491-503,共13页
We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This p... We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor. 展开更多
关键词 tag reduction multi-core processor energy efficiency
原文传递
Schedule refinement for homogeneous multi-core processors in the presence of manufacturing-caused heterogeneity
9
作者 Zhi-xiang CHEN Zhao-lin LI +2 位作者 Shan CAO Fang WANG Jie ZHOU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第12期1018-1033,共16页
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin... Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%. 展开更多
关键词 Schedule refining multi-core processor HETEROGENEITY Representative chip operating point
原文传递
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
10
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique Sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
11
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
基于龙芯3A处理器的雷达终端实现 被引量:2
12
作者 张晓明 《现代雷达》 CSCD 北大核心 2013年第10期81-83,共3页
阐述了一种基于龙芯3A处理器的雷达终端的系统架构、设计和实现。文中设计采用国产化处理器和高性能图形处理器,结合大规模可编程器件应用,通过高速总线实现TV/IR视频、雷达视频采集、分类传输处理和高分辨率显示。该设计具有处理器国... 阐述了一种基于龙芯3A处理器的雷达终端的系统架构、设计和实现。文中设计采用国产化处理器和高性能图形处理器,结合大规模可编程器件应用,通过高速总线实现TV/IR视频、雷达视频采集、分类传输处理和高分辨率显示。该设计具有处理器国产化、高性能、多功能、一体化的特点,信息安全性好、便于保障和维护,可适用于地面、车载等显控终端中。 展开更多
关键词 雷达终端 显示 国产化处理器 现场可编程门阵列 电视/红外 高速外设部件互联接口 嵌入式
在线阅读 下载PDF
国产双界面金融卡SoC芯片评测技术研究
13
作者 菅端端 任翔 梁雪连 《信息技术与标准化》 2020年第4期53-58,共6页
基于对国外金融卡评测方法的研究,梳理了国产双界面金融卡芯片的主要测试项,提出了一套针对国产双界面金融卡芯片的综合评测方法,并依据此方法对国内主要的双界面金融卡芯片的性能指标开展了比对测试。通过与国际领先水平的比较,发现了... 基于对国外金融卡评测方法的研究,梳理了国产双界面金融卡芯片的主要测试项,提出了一套针对国产双界面金融卡芯片的综合评测方法,并依据此方法对国内主要的双界面金融卡芯片的性能指标开展了比对测试。通过与国际领先水平的比较,发现了国产双界面金融卡SoC芯片的优势与不足,为此类国产芯片下一步的发展方向提出了建议。 展开更多
关键词 双界面 国密算法 国产处理器核 金融卡
在线阅读 下载PDF
龙芯处理器性能测试方法研究 被引量:6
14
作者 李士刚 黄威 张鹏 《现代电子技术》 2013年第23期88-90,共3页
通过分析龙芯处理器的体系架构、指令集的特点,基于龙芯的硬件平台和Linux操作系统的兼容性,提出一套全面测试龙芯处理器性能的方法,以实际测试龙芯2F处理器为例,获得测试结果,客观评价龙芯处理器的性能,对于军用计算机国产化具有重要... 通过分析龙芯处理器的体系架构、指令集的特点,基于龙芯的硬件平台和Linux操作系统的兼容性,提出一套全面测试龙芯处理器性能的方法,以实际测试龙芯2F处理器为例,获得测试结果,客观评价龙芯处理器的性能,对于军用计算机国产化具有重要意义。 展开更多
关键词 龙芯处理器 SPEC 性能测试 国产化
在线阅读 下载PDF
OpenMDSP:Extending OpenMP to Program Multi-Core DSPs 被引量:1
15
作者 何江舟 陈文光 +3 位作者 陈光日 郑纬民 汤志忠 叶寒栋 《Journal of Computer Science & Technology》 SCIE EI CSCD 2014年第2期316-331,共16页
Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with gene... Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with general-purpose multi-processors, multi-core DSPs normally have a more complex memory hierarchy, such as on-chip core-local memory and non-cache-coherent shared memory. As a result, efficient multi-core DSP applications are very difficult to write. The current approach used to program multi-core DSPs is based on proprietary vendor software development kits (SDKs), which only provide low-level, non-portable primitives. While it is acceptable to write coarse-grained task-level parallel code with these SDKs, writing fine-grained data parallel code with SDKs is a very tedious and error-prone approach. We believe that it is desirable to possess a high-level and portable parallel programming model for multi-core DSPs. In this paper, we propose OpenMDSP, an extension of OpenMP designed for multi-core DSPs. The goal of OpenMDSP is to fill the gap between the OpenMP memory model and the memory hierarchy of multi-core DSPs. We propose three classes of directives in OpenMDSP, including 1) data placement directives that allow programmers to control the placement of global variables conveniently, 2) distributed array directives that divide a whole array into sections and promote the sections into core-local memory to improve performance, and 3) stream access directives that promote big arrays into core-local memory section by section during parallel loop processing while hiding the latency of data movement by the direct memory access (DMA) of a DSP. We implement the compiler and runtime system for OpenMDSP on PreeScale MSC8156. The benchmarking results show that seven of nine benchmarks achieve a speedup of more than a factor of 5 when using six threads. 展开更多
关键词 OPENMP multi-core digital signal processor data parallelism Long Term Evolution
原文传递
YHFT-QDSP:High-Performance Heterogeneous Multi-Core DSP
16
作者 陈书明 万江华 +8 位作者 鲁建壮 刘仲 孙海燕 孙永节 刘衡竹 刘祥远 李振涛 徐毅 陈小文 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期214-224,共11页
Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal pro... Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal processing, this paper presents a novel heterogeneous multi-core architecture digital signal processor (DSP), YHFT-QDSP, with one RISC CPU core and 4 VLIW DSP cores. By three kinds of interconnection, YHFT-QDSP provides high efficiency message communication for inner-chip RISC core and DSP cores, inner-chip and inter-chip DSP cores. A parallel programming platform is specifically developed for the heterogeneous nmlti-core architecture of YHFT-QDSP. This parallel programming environment provides a parallel support library and a friendly interface between high level application softwares and multi- core DSP. The 130 nm CMOS custom chip design results benchmarks show that the interconnection structure of in a high speed and moderate power design. The results of typical YHFT-QDSP is much better than other related structures and achieves better speedup when using the interconnection facilities in combing methods. YHFT-QDSP has been signed off and manufactured presently. The future applications of the multi-core chip could be found in 3G wireless base station, high performance radar, industrial applications, and so on. 展开更多
关键词 digital signal processor (DSP) multi-core ARCHITECTURE parallel programming custom design
原文传递
Performance modeling of positive degraded task-pair with helper-thread in CMP
17
作者 Gu Zhimin Zheng Ninghan +3 位作者 Zhang Yi Liu Changding Tang Jie Huang Yan 《High Technology Letters》 EI CAS 2010年第3期221-226,共6页
Helper-thread of a task can hide the memory access time of irregular data on the chip muhi-core processor (CMP). For constructing a compiler that effectively supports the helper-thread of a task in the multi-core sc... Helper-thread of a task can hide the memory access time of irregular data on the chip muhi-core processor (CMP). For constructing a compiler that effectively supports the helper-thread of a task in the multi-core scenario based on the last level shared cache, this paper studies its performance stable condi- tions. Unfortunately, there is no existing model that allows extensive investigation of the impact of stable conditions, we present the base of pre-computation that is formalized by our degraded task-pair 〈 T, T' 〉 with the helper-thread, and its stable conditions are analyzed. Finally, a novel performance model and a constructing method of pre-computation based on our positive degraded task-pair are proposed. The efficient results are shown by our experiments. If we further exploit memory level parallelism (MLP) for our task-pair, the task-pair 〈 T, T' 〉 can reach better performance. 展开更多
关键词 chip multi-core processor (CMP) helper-thread pre-computation performance model
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部