Present a kind of method which is used to communicate between serial serial port and peripheral equipment dynamicly and real-time using multithreading technique based on the basic principle of communication and multit...Present a kind of method which is used to communicate between serial serial port and peripheral equipment dynamicly and real-time using multithreading technique based on the basic principle of communication and multitasking mechanism in the circumstance of Windows. This method resolves the question of Real-time answering in the serial communication validly, reduces losing rate of data and improves reliability of system. This article presents a general method used in the serial communication which is practical.展开更多
Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CN...Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CNN acceleration with high energy efficiency and processing performance is efficient data reuse by exploiting the inherent data locality. In this paper, we propose a novel CGRA (Coarse Grained Reconfigurable Array) architecture with time-domain multithreading for exploiting input data locality. The multithreading on each processing element enables the input data reusing through multiple computation periods. This paper presents the accelerator design performance analysis of the proposed architecture. We examine the structure of memory subsystems, as well as the architecture of the computing array, to supply required data with minimal performance overhead. We explore efficient architecture design alternatives based on the characteristics of modern CNN configurations. The evaluation results show that the available bandwidth of the external memory can be utilized efficiently when the output plane is wider (in earlier layers of many CNNs) while the input data locality can be utilized maximally when the number of output channel is larger (in later layers).展开更多
To overcome the ever-increasing susceptibility to transient-fault in processors, various redundant multithreading (RMT) architectures have been proposed, which is becoming a most effective approach for detecting and...To overcome the ever-increasing susceptibility to transient-fault in processors, various redundant multithreading (RMT) architectures have been proposed, which is becoming a most effective approach for detecting and recovering from transient-fault. This paper surveys a wide range of RMT architectures-from the original AR-SMT(A-stream R-stream Simultaneous MultiThreading) to the most-recent SD-SRT (Slack-Decode Simultaneous Redundant Threading), presenting traverse analyses and comparisons among them, and hereby demonstrates its evolution and tendency. Finally, some directions and suggestions are put forward for the further RMT research and development.展开更多
Transient fault detection mechanism is added to simultaneous multithreading architecture. By exploiting both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism), Simultaneous Multithreading (SMT) Fa...Transient fault detection mechanism is added to simultaneous multithreading architecture. By exploiting both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism), Simultaneous Multithreading (SMT) Fault Tolerance Processor can be expected to achieve better tradeoff between performance and hardware cost than traditional Fault Tolerance Processors. Detailed simulations of 3 of SPEC95 benchmarks show that executing two redundant programs on the fault-tolerant microarchitecture takes only 40%–61%longer than running a single version of the program. The new instruction fetch algorithm enhances the performance by 0.4%~1%to most of the benchmarks we choose randomly.展开更多
In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware m...In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware multithread is discussed and a novel dynamic multithreaded architecture is proposed. The proposed architecture saves the energy wasted by removing idle threads without manipulation on the original architecture, fulfills a seamless switching mechanism which protects active threads and avoids pipeline stall during power mode switching. The report of an implemented dynamic multithreaded processor with 45 nm process from synthesis tool indicates that the area of dynamic multithreaded architecture is only 2.27% higher than the static one in achieving dynamic power dissipation, and consumes 1.3% more power in the same peak performance.展开更多
最优线程数设置是影响多线程程序性能和功耗的关键之一。然而,目前寻找最优线程数的算法通常是从单一固定起点开始搜索,往往会造成搜索精度低、搜索开销大的问题。最优线程数的分布和位置与多种因素有关,包括程序所属类型、优化目标(性...最优线程数设置是影响多线程程序性能和功耗的关键之一。然而,目前寻找最优线程数的算法通常是从单一固定起点开始搜索,往往会造成搜索精度低、搜索开销大的问题。最优线程数的分布和位置与多种因素有关,包括程序所属类型、优化目标(性能、功耗和EDP(Energy-delay Product))、并行的多线程区域、软硬件配置参数等。围绕能效优先的最优线程数搜索问题,提出了能效优先的特定起点分类最优线程数搜索算法(Energy-Efficiency-First Optimal Thread Number Search Algorithm based on Specific Starting Point Classification,简称TS^(3)方法)”,通过设计基于程序分类的特殊起点设定方法来确定搜索起点,并采用启发式算法和二分查找方法搜索最优线程数,提升搜索效率,有效提升了能效优先目标(性能最优、功耗最优、能效EDP最优)下的最优线程数搜索精度并降低了搜索开销。在两个x86和一个ARM平台上用8个benchmark对算法有效性进行了详细实验验证,结果表明,与Baseline相比,TS^(3)方法的性能平均提升0.29%(平台A)、0.17%(平台B)、10.77%(平台C);功耗平均降低2.35%(平台A)、1.87%(平台B)、15.97%(平台C);EDP平均降低6.36%(平台A)、5.07%(平台B)、46.94%(平台C)。在3个平台上,与目前经典搜索方法相比,TS^(3)方法的性能平均提升10.16%,功耗平均降低13.45%,EDP平均降低23.77%;搜索开销平均降低86.8%。展开更多
随着超大规模集成电路(Very Large Scale Integration Circuit,VLSI)制造工艺的快速发展以及其对应集成度的不断提高,数字集成电路的设计迎来了许多挑战。时钟树综合是数字后端设计的重要部分,现有的时钟树综合算法开始面临迭代效率变...随着超大规模集成电路(Very Large Scale Integration Circuit,VLSI)制造工艺的快速发展以及其对应集成度的不断提高,数字集成电路的设计迎来了许多挑战。时钟树综合是数字后端设计的重要部分,现有的时钟树综合算法开始面临迭代效率变低和收敛速度变慢的问题。因此,提出了一种同步并发时钟树分级聚类算法(Synchronous Clock-tree Hierarchical Partitioning and Clustering,SC-HPC)。从系统优化的角度出发,SC-HPC将原始的寄存器聚类过程转化为粗聚类和细聚类两步。粗聚类将布局完成的寄存器分为N大簇群,进一步把N个簇的细化任务分配给用户可调度的线程中进行加速处理。细聚类是根据缓冲器最大扇出的规则进行更加细致地划分寄存器。实验结果表明,相较于现有方法,SC-HPC算法降低了缓冲器数量(30%以上)和程序运行时长(20%以上)。展开更多
文摘Present a kind of method which is used to communicate between serial serial port and peripheral equipment dynamicly and real-time using multithreading technique based on the basic principle of communication and multitasking mechanism in the circumstance of Windows. This method resolves the question of Real-time answering in the serial communication validly, reduces losing rate of data and improves reliability of system. This article presents a general method used in the serial communication which is practical.
文摘Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CNN acceleration with high energy efficiency and processing performance is efficient data reuse by exploiting the inherent data locality. In this paper, we propose a novel CGRA (Coarse Grained Reconfigurable Array) architecture with time-domain multithreading for exploiting input data locality. The multithreading on each processing element enables the input data reusing through multiple computation periods. This paper presents the accelerator design performance analysis of the proposed architecture. We examine the structure of memory subsystems, as well as the architecture of the computing array, to supply required data with minimal performance overhead. We explore efficient architecture design alternatives based on the characteristics of modern CNN configurations. The evaluation results show that the available bandwidth of the external memory can be utilized efficiently when the output plane is wider (in earlier layers of many CNNs) while the input data locality can be utilized maximally when the number of output channel is larger (in later layers).
基金Supported by the National Natural Science Foun-dation of China (60503015)
文摘To overcome the ever-increasing susceptibility to transient-fault in processors, various redundant multithreading (RMT) architectures have been proposed, which is becoming a most effective approach for detecting and recovering from transient-fault. This paper surveys a wide range of RMT architectures-from the original AR-SMT(A-stream R-stream Simultaneous MultiThreading) to the most-recent SD-SRT (Slack-Decode Simultaneous Redundant Threading), presenting traverse analyses and comparisons among them, and hereby demonstrates its evolution and tendency. Finally, some directions and suggestions are put forward for the further RMT research and development.
基金Supported by the National Natural Science Funda tion of China (60103002)
文摘Transient fault detection mechanism is added to simultaneous multithreading architecture. By exploiting both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism), Simultaneous Multithreading (SMT) Fault Tolerance Processor can be expected to achieve better tradeoff between performance and hardware cost than traditional Fault Tolerance Processors. Detailed simulations of 3 of SPEC95 benchmarks show that executing two redundant programs on the fault-tolerant microarchitecture takes only 40%–61%longer than running a single version of the program. The new instruction fetch algorithm enhances the performance by 0.4%~1%to most of the benchmarks we choose randomly.
基金supported partially by the National High Technical Research and Development Program of China (863 Program) under Grants No. 2011AA040101, No. 2008AA01Z134the National Natural Science Foundation of China under Grants No. 61003251, No. 61172049, No. 61173150+2 种基金the Doctoral Fund of Ministry of Education of China under Grant No. 20100006110015Beijing Municipal Natural Science Foundation under Grant No. Z111100054011078the 2012 Ladder Plan Project of Beijing Key Laboratory of Knowledge Engineering for Materials Science under Grant No. Z121101002812005
文摘In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware multithread is discussed and a novel dynamic multithreaded architecture is proposed. The proposed architecture saves the energy wasted by removing idle threads without manipulation on the original architecture, fulfills a seamless switching mechanism which protects active threads and avoids pipeline stall during power mode switching. The report of an implemented dynamic multithreaded processor with 45 nm process from synthesis tool indicates that the area of dynamic multithreaded architecture is only 2.27% higher than the static one in achieving dynamic power dissipation, and consumes 1.3% more power in the same peak performance.
文摘最优线程数设置是影响多线程程序性能和功耗的关键之一。然而,目前寻找最优线程数的算法通常是从单一固定起点开始搜索,往往会造成搜索精度低、搜索开销大的问题。最优线程数的分布和位置与多种因素有关,包括程序所属类型、优化目标(性能、功耗和EDP(Energy-delay Product))、并行的多线程区域、软硬件配置参数等。围绕能效优先的最优线程数搜索问题,提出了能效优先的特定起点分类最优线程数搜索算法(Energy-Efficiency-First Optimal Thread Number Search Algorithm based on Specific Starting Point Classification,简称TS^(3)方法)”,通过设计基于程序分类的特殊起点设定方法来确定搜索起点,并采用启发式算法和二分查找方法搜索最优线程数,提升搜索效率,有效提升了能效优先目标(性能最优、功耗最优、能效EDP最优)下的最优线程数搜索精度并降低了搜索开销。在两个x86和一个ARM平台上用8个benchmark对算法有效性进行了详细实验验证,结果表明,与Baseline相比,TS^(3)方法的性能平均提升0.29%(平台A)、0.17%(平台B)、10.77%(平台C);功耗平均降低2.35%(平台A)、1.87%(平台B)、15.97%(平台C);EDP平均降低6.36%(平台A)、5.07%(平台B)、46.94%(平台C)。在3个平台上,与目前经典搜索方法相比,TS^(3)方法的性能平均提升10.16%,功耗平均降低13.45%,EDP平均降低23.77%;搜索开销平均降低86.8%。
文摘随着超大规模集成电路(Very Large Scale Integration Circuit,VLSI)制造工艺的快速发展以及其对应集成度的不断提高,数字集成电路的设计迎来了许多挑战。时钟树综合是数字后端设计的重要部分,现有的时钟树综合算法开始面临迭代效率变低和收敛速度变慢的问题。因此,提出了一种同步并发时钟树分级聚类算法(Synchronous Clock-tree Hierarchical Partitioning and Clustering,SC-HPC)。从系统优化的角度出发,SC-HPC将原始的寄存器聚类过程转化为粗聚类和细聚类两步。粗聚类将布局完成的寄存器分为N大簇群,进一步把N个簇的细化任务分配给用户可调度的线程中进行加速处理。细聚类是根据缓冲器最大扇出的规则进行更加细致地划分寄存器。实验结果表明,相较于现有方法,SC-HPC算法降低了缓冲器数量(30%以上)和程序运行时长(20%以上)。