Reconfigurable array architecture has become an important hardware platform for edge-side deployment of convolutional neural networks due to their high parallelism and flexible programmability.However,traditional mult...Reconfigurable array architecture has become an important hardware platform for edge-side deployment of convolutional neural networks due to their high parallelism and flexible programmability.However,traditional multi-branch convolutional networks suffer from computational redundancy,high memory access overhead,and inefficient branch fusion.Therefore,this paper proposes an adaptive multi-branch convolutional module(AMBC)that integrates software-hardware co-optimization.During training,the learnable fusion coefficients are introduced to enable adaptive fusion of multi-scale features,while in the inference phase,the multiple branches and their normalization parameters are merged with the fusion coefficients into a single 3×3 convolutional kernel through operator fusion.On the SIREA-288 reconfigurable platform,compared with unoptimized multi-branch networks,the proposed AMBC reduces external memory accesses by 47.91%and inference latency by 47.20%,achieving a 1.90×speedup.This approach maximizes the utilization of the reconfigurable logic while minimizing both reconfiguration and data-movement overheads in edge inference.展开更多
Dynamic optimization relies on runtime profile information to improve the performance of program execution. Traditional profiling techniques incur significant overhead and are not suitable for dynamic optimization. In...Dynamic optimization relies on runtime profile information to improve the performance of program execution. Traditional profiling techniques incur significant overhead and are not suitable for dynamic optimization. In this paper, a new profiling technique is proposed, that incorporates the strength of both software and hardware to achieve near-zero overhead profiling. The compiler passes profiling requests as a few bits of information in branch instructions to the hardware, and the processor executes profiling operations asynchronously in available free slots or on dedicated hardware. The compiler instrumentation of this technique is implemented using an Itanium research compiler. The result shows that the accurate block profiling incurs very little overhead to the user program in terms of the program scheduling cycles. For example, the average overhead is 0.6% for the SPECint95 benchmarks. The hardware support required for the new profiling is practical. The technique is extended to collect edge profiles for continuous phase transition detection. It is believed that the hardware-software collaborative scheme will enable many profile-driven dynamic optimizations for EPIC processors such as the Itanium processors.展开更多
We present a simulation framework for wireless sensor networks developed to allow the design exploration and the complete microprocessor-instruction-level debug of network formation, data congestion, nodes interaction...We present a simulation framework for wireless sensor networks developed to allow the design exploration and the complete microprocessor-instruction-level debug of network formation, data congestion, nodes interaction, all in one simulation environment. A specifically innovative feature is the co-emulation of selected nodes at clock-cycle-accurate hardware processing level, allowing code debug and exact execution latency evaluation (considering both protocol stack and application), together with other nodes at abstract protocol level, meeting a designer’s needs of simulation speed, scalability and reliability. The simulator is centered on the Zigbee protocol and can be retargeted for different node micro-architectures.展开更多
At present, the development and implementation of digital transformation are the keys to promoting high-quality industry development. The new digital fabrication method of robotic 3D printing is a research area being ...At present, the development and implementation of digital transformation are the keys to promoting high-quality industry development. The new digital fabrication method of robotic 3D printing is a research area being studied by many to tackle the issue of the declining productivity of traditional construction methods. Although many studies have been done, most of the current 3D printing projects are facing limitations in terms of scale. In order to bridge the gap, this article proposed a mass customization 3D printing framework system for large-scale projects. This article discusses how mass customization is made possible through the joint operation of the FUROBOT software and 3D printing hardware. By taking the east gate of Nanjing Happy Valley Plaza as a case study, the article demonstrates and studies the feasibility of the large-scale mass customization 3D printing framework system.展开更多
Nowadays,with the increasing depth of CNNs,the number of computation and storage requirements with weights expands significantly,preventing their wide deployment on resource-constrained application scenarios such as e...Nowadays,with the increasing depth of CNNs,the number of computation and storage requirements with weights expands significantly,preventing their wide deployment on resource-constrained application scenarios such as embedded systems.To improve the efficiency of the current deep CNN inference stage,researchers have attempted to explore weight pruning techniques on CNN accelerators(e.g.,systolic arrays)to avoid the number of unimportant weights storage and computation.However,these attempts either suffer expensive extra hardware costs to encode/decode the irregular sparse weight pattern on accelerators or bring finite performance improvement due to structured pruning’s modest compression ratio.In order to address the above challenge,this paper proposes FASS-Pruner,a Fine-grained Accelerator-aware pruning framework via intra-filter Splitting and inter-filter Shuffling:(1)Considering the round-by-round execution behavior of CNN accelerator,FASS-Pruner split filters into multiple rounds to perform column-wise-weight pruning;(2)Leveraging the calculation independence characteristics across filters on CNN accelerators,FASS-Pruner shuffles the filters to prune the unimportant rowwise weights at CNN accelerator.Combining the sparse pattern of pruned CNN and the dataflow of systolic array,we modify the systolic array-based accelerator to enable it to execute pruned sparse CNN with better performance and lower energy consumption.By condensing the pruned sparse weights in systolic arrays,FASS-Pruner achieves a comparable pruning ratio while preserving the original data flow of CNN accelerators,thereby achieving significant performance and energy saving.展开更多
基金Supported by the National Science and Technology Major Project of China(2022ZD0119005)the Natural Science Project of Shaanxi Province(2025JC-YBMS-754,2024JC-YBMS-539)。
文摘Reconfigurable array architecture has become an important hardware platform for edge-side deployment of convolutional neural networks due to their high parallelism and flexible programmability.However,traditional multi-branch convolutional networks suffer from computational redundancy,high memory access overhead,and inefficient branch fusion.Therefore,this paper proposes an adaptive multi-branch convolutional module(AMBC)that integrates software-hardware co-optimization.During training,the learnable fusion coefficients are introduced to enable adaptive fusion of multi-scale features,while in the inference phase,the multiple branches and their normalization parameters are merged with the fusion coefficients into a single 3×3 convolutional kernel through operator fusion.On the SIREA-288 reconfigurable platform,compared with unoptimized multi-branch networks,the proposed AMBC reduces external memory accesses by 47.91%and inference latency by 47.20%,achieving a 1.90×speedup.This approach maximizes the utilization of the reconfigurable logic while minimizing both reconfiguration and data-movement overheads in edge inference.
文摘Dynamic optimization relies on runtime profile information to improve the performance of program execution. Traditional profiling techniques incur significant overhead and are not suitable for dynamic optimization. In this paper, a new profiling technique is proposed, that incorporates the strength of both software and hardware to achieve near-zero overhead profiling. The compiler passes profiling requests as a few bits of information in branch instructions to the hardware, and the processor executes profiling operations asynchronously in available free slots or on dedicated hardware. The compiler instrumentation of this technique is implemented using an Itanium research compiler. The result shows that the accurate block profiling incurs very little overhead to the user program in terms of the program scheduling cycles. For example, the average overhead is 0.6% for the SPECint95 benchmarks. The hardware support required for the new profiling is practical. The technique is extended to collect edge profiles for continuous phase transition detection. It is believed that the hardware-software collaborative scheme will enable many profile-driven dynamic optimizations for EPIC processors such as the Itanium processors.
文摘We present a simulation framework for wireless sensor networks developed to allow the design exploration and the complete microprocessor-instruction-level debug of network formation, data congestion, nodes interaction, all in one simulation environment. A specifically innovative feature is the co-emulation of selected nodes at clock-cycle-accurate hardware processing level, allowing code debug and exact execution latency evaluation (considering both protocol stack and application), together with other nodes at abstract protocol level, meeting a designer’s needs of simulation speed, scalability and reliability. The simulator is centered on the Zigbee protocol and can be retargeted for different node micro-architectures.
基金supported by the Shanghai Science and Technology Committee(Grant No.21DZ1204500)National Natural Science Foundation of China(Grant No.U1913603)。
文摘At present, the development and implementation of digital transformation are the keys to promoting high-quality industry development. The new digital fabrication method of robotic 3D printing is a research area being studied by many to tackle the issue of the declining productivity of traditional construction methods. Although many studies have been done, most of the current 3D printing projects are facing limitations in terms of scale. In order to bridge the gap, this article proposed a mass customization 3D printing framework system for large-scale projects. This article discusses how mass customization is made possible through the joint operation of the FUROBOT software and 3D printing hardware. By taking the east gate of Nanjing Happy Valley Plaza as a case study, the article demonstrates and studies the feasibility of the large-scale mass customization 3D printing framework system.
基金supported by the National Natural Science Foundation of China(NSFC)(Grants No.U19A2061,No.62272190)Sichuan Major R&D Project(Grant No.22QYCX0168).
文摘Nowadays,with the increasing depth of CNNs,the number of computation and storage requirements with weights expands significantly,preventing their wide deployment on resource-constrained application scenarios such as embedded systems.To improve the efficiency of the current deep CNN inference stage,researchers have attempted to explore weight pruning techniques on CNN accelerators(e.g.,systolic arrays)to avoid the number of unimportant weights storage and computation.However,these attempts either suffer expensive extra hardware costs to encode/decode the irregular sparse weight pattern on accelerators or bring finite performance improvement due to structured pruning’s modest compression ratio.In order to address the above challenge,this paper proposes FASS-Pruner,a Fine-grained Accelerator-aware pruning framework via intra-filter Splitting and inter-filter Shuffling:(1)Considering the round-by-round execution behavior of CNN accelerator,FASS-Pruner split filters into multiple rounds to perform column-wise-weight pruning;(2)Leveraging the calculation independence characteristics across filters on CNN accelerators,FASS-Pruner shuffles the filters to prune the unimportant rowwise weights at CNN accelerator.Combining the sparse pattern of pruned CNN and the dataflow of systolic array,we modify the systolic array-based accelerator to enable it to execute pruned sparse CNN with better performance and lower energy consumption.By condensing the pruned sparse weights in systolic arrays,FASS-Pruner achieves a comparable pruning ratio while preserving the original data flow of CNN accelerators,thereby achieving significant performance and energy saving.