In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware m...In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware multithread is discussed and a novel dynamic multithreaded architecture is proposed. The proposed architecture saves the energy wasted by removing idle threads without manipulation on the original architecture, fulfills a seamless switching mechanism which protects active threads and avoids pipeline stall during power mode switching. The report of an implemented dynamic multithreaded processor with 45 nm process from synthesis tool indicates that the area of dynamic multithreaded architecture is only 2.27% higher than the static one in achieving dynamic power dissipation, and consumes 1.3% more power in the same peak performance.展开更多
Pulse echo accumulation is commonly employed in coherent Doppler wind LiDAR(light detection and ranging)under the assumption of steady wind.Here,the measured spectral data are analyzed in the time dimension and freque...Pulse echo accumulation is commonly employed in coherent Doppler wind LiDAR(light detection and ranging)under the assumption of steady wind.Here,the measured spectral data are analyzed in the time dimension and frequency dimension to cope with the temporal wind shear and achieve the optimal accumulation time.A hardware-efficient algorithm combining the interpolation and cross-correlation is used to enhance the wind retrieval accuracy by reducing the frequency sampling interval and then reduce the spectral width calculation error.Moreover,the temporal broadening effect and spatial broadening effect are decoupled according to the strategy we developed.展开更多
Neural network pruning is a popular approach to reducing the computational complexity of deep neural networks.In recent years,as growing evidence shows that conventional network pruning methods employ inappropriate pr...Neural network pruning is a popular approach to reducing the computational complexity of deep neural networks.In recent years,as growing evidence shows that conventional network pruning methods employ inappropriate proxy metrics,and as new types of hardware become increasingly available,hardware-aware network pruning that incorporates hardware characteristics in the loop of network pruning has gained growing attention,Both network accuracy and hardware efficiency(latency,memory consumption,etc.)are critical objectives to the success of network pruning,but the conflict between the multiple objectives makes it impossible to find a single optimal solution.Previous studies mostly convert the hardware-aware network pruning to optimization problems with a single objective.In this paper,we propose to solve the hardware-aware network pruning problem with Multi-Objective Evolutionary Algorithms(MOEAs).Specifically,we formulate the problem as a multi-objective optimization problem,and propose a novel memetic MOEA,namely HAMP,that combines an efficient portfoliobased selection and a surrogate-assisted local search,to solve it.Empirical studies demonstrate the potential of MOEAs in providing simultaneously a set of alternative solutions and the superiority of HAMP compared to the state-of-the-art hardware-aware network pruning method.展开更多
The design of a high-speed decoder using traditional partly parallel architecture for Non-Quasi-Cyclic(NQC)Low-Density Parity-Check(LDPC)codes is a challenging problem due to its high memory-block cost and low hardwar...The design of a high-speed decoder using traditional partly parallel architecture for Non-Quasi-Cyclic(NQC)Low-Density Parity-Check(LDPC)codes is a challenging problem due to its high memory-block cost and low hardware utilization efficiency.In this paper,we present efficient hardware implementation schemes for NQCLDPC codes.First,we propose an implementation-oriented construction scheme for NQC-LDPC codes to avoid memory-access conflict in the partly parallel decoder.Then,we propose a Modified Overlapped Message-Passing(MOMP)algorithm for the hardware implementation of NQC-LDPC codes.This algorithm doubles the hardware utilization efficiency and supports a higher degree of parallelism than that used in the Overlapped Message Passing(OMP)technique proposed in previous works.We also present single-core and multi-core decoder architectures in the proposed MOMP algorithm to reduce memory cost and improve circuit efficiency.Moreover,we introduce a technique called the cycle bus to further reduce the number of block RAMs in multi-core decoders.Using numerical examples,we show that,for a rate-2/3,length-15360 NQC-LDPC code with 8.43-d B coding gain for Binary PhaseShift Keying(BPSK)in an Additive White Gaussian Noise(AWGN)channel,the decoder with the proposed scheme achieves a 23.8%–52.6%reduction in logic utilization per Mbps and a 29.0%–90.0%reduction in message-memory bits per Mbps.展开更多
In this paper, we propose a new lightweight block cipher named RECTANGLE. The main idea of the design of RECTANGLE is to allow lightweight and fast implementations using bit-slice techniques. RECTANGLE uses an SP-netw...In this paper, we propose a new lightweight block cipher named RECTANGLE. The main idea of the design of RECTANGLE is to allow lightweight and fast implementations using bit-slice techniques. RECTANGLE uses an SP-network. The substitution layer consists of 16 4 × 4 S-boxes in parallel. The permutation layer is composed of 3 rotations. As shown in this paper, RECTANGLE offers great performance in both hardware and software environment, which provides enough flexibility for different application scenario. The following are3 main advantages of RECTANGLE. First, RECTANGLE is extremely hardware-friendly. For the 80-bit key version, a one-cycle-per-round parallel implementation only needs 1600 gates for a throughput of 246 Kbits/s at100 k Hz clock and an energy efficiency of 3.0 p J/bit. Second, RECTANGLE achieves a very competitive software speed among the existing lightweight block ciphers due to its bit-slice style. Using 128-bit SSE instructions,a bit-slice implementation of RECTANGLE reaches an average encryption speed of about 3.9 cycles/byte for messages around 3000 bytes. Last but not least, we propose new design criteria for the RECTANGLE S-box.Due to our careful selection of the S-box and the asymmetric design of the permutation layer, RECTANGLE achieves a very good security-performance tradeoff. Our extensive and deep security analysis shows that the highest number of rounds that we can attack, is 18(out of 25).展开更多
Area-efficient design methodology is proposed for the analog decoding implementations of the rate-l/2 accumulate repeat-4 jagged-accumulate (AR4JA) low density parity check (LDPC) code. The proposed approach is de...Area-efficient design methodology is proposed for the analog decoding implementations of the rate-l/2 accumulate repeat-4 jagged-accumulate (AR4JA) low density parity check (LDPC) code. The proposed approach is designed using optimized decoding architecture and regularized routing network, in such a way that the overall wiring overhead is minimized and the silicon area utilization is significantly improved. The prototyping chip used to verily the approach is tully integrated in a four-metal double-poly 0.35 lam complementary metal oxide semiconductor (CMOS) technology, and includes an input-output interface that maximizes the decoder throughput. The decoding core area is 2.02 mm2 with a post-layout area utilization of 80%. The decoder was successfully tested at the maximum data rate of 10 Mbit/s, with a core power consumption of 6.78 mW at 3.3 V, which corresponds to an energy per decoded bit of 0.677 nJ. The proposed analog LDPC decoder with low processing power and high-reliability is suitable lbr space- and power-constrained spacecraft system.展开更多
基金supported partially by the National High Technical Research and Development Program of China (863 Program) under Grants No. 2011AA040101, No. 2008AA01Z134the National Natural Science Foundation of China under Grants No. 61003251, No. 61172049, No. 61173150+2 种基金the Doctoral Fund of Ministry of Education of China under Grant No. 20100006110015Beijing Municipal Natural Science Foundation under Grant No. Z111100054011078the 2012 Ladder Plan Project of Beijing Key Laboratory of Knowledge Engineering for Materials Science under Grant No. Z121101002812005
文摘In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware multithread is discussed and a novel dynamic multithreaded architecture is proposed. The proposed architecture saves the energy wasted by removing idle threads without manipulation on the original architecture, fulfills a seamless switching mechanism which protects active threads and avoids pipeline stall during power mode switching. The report of an implemented dynamic multithreaded processor with 45 nm process from synthesis tool indicates that the area of dynamic multithreaded architecture is only 2.27% higher than the static one in achieving dynamic power dissipation, and consumes 1.3% more power in the same peak performance.
基金Project supported by the Shanghai Science and Technology Innovation Action(Grant No.22dz1208700).
文摘Pulse echo accumulation is commonly employed in coherent Doppler wind LiDAR(light detection and ranging)under the assumption of steady wind.Here,the measured spectral data are analyzed in the time dimension and frequency dimension to cope with the temporal wind shear and achieve the optimal accumulation time.A hardware-efficient algorithm combining the interpolation and cross-correlation is used to enhance the wind retrieval accuracy by reducing the frequency sampling interval and then reduce the spectral width calculation error.Moreover,the temporal broadening effect and spatial broadening effect are decoupled according to the strategy we developed.
基金the National Natural Science Foundation of China(62106098)the Stable Support Plan Program of Shenzhen Natural Science Fund(20200925154942002)the M0E University Scientific-Technological Innovation Plan Program.
文摘Neural network pruning is a popular approach to reducing the computational complexity of deep neural networks.In recent years,as growing evidence shows that conventional network pruning methods employ inappropriate proxy metrics,and as new types of hardware become increasingly available,hardware-aware network pruning that incorporates hardware characteristics in the loop of network pruning has gained growing attention,Both network accuracy and hardware efficiency(latency,memory consumption,etc.)are critical objectives to the success of network pruning,but the conflict between the multiple objectives makes it impossible to find a single optimal solution.Previous studies mostly convert the hardware-aware network pruning to optimization problems with a single objective.In this paper,we propose to solve the hardware-aware network pruning problem with Multi-Objective Evolutionary Algorithms(MOEAs).Specifically,we formulate the problem as a multi-objective optimization problem,and propose a novel memetic MOEA,namely HAMP,that combines an efficient portfoliobased selection and a surrogate-assisted local search,to solve it.Empirical studies demonstrate the potential of MOEAs in providing simultaneously a set of alternative solutions and the superiority of HAMP compared to the state-of-the-art hardware-aware network pruning method.
基金supported in part by the National Natural Science Foundation of China(Nos.61101072 and 61132002)the new strategic industries development projects of Shenzhen city(No.ZDSY20120616141333842)Tsinghua University Initiative Scientific Research Program(No.2012Z10132)
文摘The design of a high-speed decoder using traditional partly parallel architecture for Non-Quasi-Cyclic(NQC)Low-Density Parity-Check(LDPC)codes is a challenging problem due to its high memory-block cost and low hardware utilization efficiency.In this paper,we present efficient hardware implementation schemes for NQCLDPC codes.First,we propose an implementation-oriented construction scheme for NQC-LDPC codes to avoid memory-access conflict in the partly parallel decoder.Then,we propose a Modified Overlapped Message-Passing(MOMP)algorithm for the hardware implementation of NQC-LDPC codes.This algorithm doubles the hardware utilization efficiency and supports a higher degree of parallelism than that used in the Overlapped Message Passing(OMP)technique proposed in previous works.We also present single-core and multi-core decoder architectures in the proposed MOMP algorithm to reduce memory cost and improve circuit efficiency.Moreover,we introduce a technique called the cycle bus to further reduce the number of block RAMs in multi-core decoders.Using numerical examples,we show that,for a rate-2/3,length-15360 NQC-LDPC code with 8.43-d B coding gain for Binary PhaseShift Keying(BPSK)in an Additive White Gaussian Noise(AWGN)channel,the decoder with the proposed scheme achieves a 23.8%–52.6%reduction in logic utilization per Mbps and a 29.0%–90.0%reduction in message-memory bits per Mbps.
基金supported by National Natural Science Foundation of China(Grant No.61379138)Research Fund KU Leuven(OT/13/071)+1 种基金"Strategic Priority Research Program"of the Chinese Academy of Sciences(Grant No.XDA06010701)National High-tech R&D Program of China(863 Program)(Grant No.2013AA014002)
文摘In this paper, we propose a new lightweight block cipher named RECTANGLE. The main idea of the design of RECTANGLE is to allow lightweight and fast implementations using bit-slice techniques. RECTANGLE uses an SP-network. The substitution layer consists of 16 4 × 4 S-boxes in parallel. The permutation layer is composed of 3 rotations. As shown in this paper, RECTANGLE offers great performance in both hardware and software environment, which provides enough flexibility for different application scenario. The following are3 main advantages of RECTANGLE. First, RECTANGLE is extremely hardware-friendly. For the 80-bit key version, a one-cycle-per-round parallel implementation only needs 1600 gates for a throughput of 246 Kbits/s at100 k Hz clock and an energy efficiency of 3.0 p J/bit. Second, RECTANGLE achieves a very competitive software speed among the existing lightweight block ciphers due to its bit-slice style. Using 128-bit SSE instructions,a bit-slice implementation of RECTANGLE reaches an average encryption speed of about 3.9 cycles/byte for messages around 3000 bytes. Last but not least, we propose new design criteria for the RECTANGLE S-box.Due to our careful selection of the S-box and the asymmetric design of the permutation layer, RECTANGLE achieves a very good security-performance tradeoff. Our extensive and deep security analysis shows that the highest number of rounds that we can attack, is 18(out of 25).
文摘Area-efficient design methodology is proposed for the analog decoding implementations of the rate-l/2 accumulate repeat-4 jagged-accumulate (AR4JA) low density parity check (LDPC) code. The proposed approach is designed using optimized decoding architecture and regularized routing network, in such a way that the overall wiring overhead is minimized and the silicon area utilization is significantly improved. The prototyping chip used to verily the approach is tully integrated in a four-metal double-poly 0.35 lam complementary metal oxide semiconductor (CMOS) technology, and includes an input-output interface that maximizes the decoder throughput. The decoding core area is 2.02 mm2 with a post-layout area utilization of 80%. The decoder was successfully tested at the maximum data rate of 10 Mbit/s, with a core power consumption of 6.78 mW at 3.3 V, which corresponds to an energy per decoded bit of 0.677 nJ. The proposed analog LDPC decoder with low processing power and high-reliability is suitable lbr space- and power-constrained spacecraft system.