期刊文献+
共找到7篇文章
< 1 >
每页显示 20 50 100
MEMORY-EFFICIENT PATH METRIC UPDATE METHOD IN MAP DECODER IMPLEMENTATION
1
作者 He Chun Hu Jianhao 《Journal of Electronics(China)》 2008年第2期145-149,共5页
A novel memory efficient path metric update is proposed for Maximum A Posteriori(MAP) decoder of turbo codes to reduce the memory requirement of state metric information calcu-lation. For MAP decoder,the same memory c... A novel memory efficient path metric update is proposed for Maximum A Posteriori(MAP) decoder of turbo codes to reduce the memory requirement of state metric information calcu-lation. For MAP decoder,the same memory can be shared by the forward and backward metrics with this metric update scheme. The forward and backward metrics update can be performed at the same time. And all of the extrinsic information can be calculated at the end of metric update. Therefore,the latency and area in the implementation will be reduced with the proposed metric update method. 展开更多
关键词 Turbo decode Metric update method memory efficient
在线阅读 下载PDF
Memory Efficient Two-Pass 3D FFT Algorithm for Intel~ Xeon Phi^(TM) Coprocessor 被引量:2
2
作者 刘益群 李焱 +1 位作者 张云泉 张先轶 《Journal of Computer Science & Technology》 SCIE EI CSCD 2014年第6期989-1002,共14页
Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) Architecture ot only high floating-point performance, but also substantial ... Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) Architecture ot only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-di fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of rided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into ensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively erence in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. el parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of loc On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectoriz employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel(R) PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in offload mode, which ts the vendor-specific Intel(R)MKL library by a factor of up to 2.22X. 展开更多
关键词 3D-FFT memory efficie many-core Many Integrated Core Intel(R) Xeon PhiTM
原文传递
Memory-efficient tensor parallelism for long-sequence Transformer training
3
作者 Peng LIANG Linbo QIAO +3 位作者 Yanqi SHI Hao ZHENG Yu TANG Dongsheng LI 《Frontiers of Information Technology & Electronic Engineering》 2025年第5期770-787,共18页
Transformer-based models like large language models(LLMs)have attracted significant attention in recent years due to their superior performance.A long sequence of input tokens is essential for industrial LLMs to provi... Transformer-based models like large language models(LLMs)have attracted significant attention in recent years due to their superior performance.A long sequence of input tokens is essential for industrial LLMs to provide better user services.However,memory consumption increases quadratically with the increase of sequence length,posing challenges for scaling up long-sequence training.Current parallelism methods produce duplicated tensors during execution,leaving space for improving memory efficiency.Additionally,tensor parallelism(TP)cannot achieve effective overlap between computation and communication.To solve these weaknesses,we propose a general parallelism method called memory-efficient tensor parallelism(METP),designed for the computation of two consecutive matrix multiplications and a possible function between them(O=f(AB)C),which is the kernel computation component in Transformer training.METP distributes subtasks of computing O to multiple devices and uses send/recv instead of collective communication to exchange submatrices for finishing the computation,avoiding producing duplicated tensors.We also apply the double buffering technique to achieve better overlap between computation and communication.We present the theoretical condition of full overlap to help instruct the long-sequence training of Transformers.Suppose the parallel degree is p;through theoretical analysis,we prove that METP provides O(1/p^(3))memory overhead when not using FlashAttention to compute attention and could save at least 41.7%memory compared to TP when using FlashAttention to compute multi-head self-attention.Our experimental results demonstrate that METP can increase the sequence length by 2.38–2.99 times compared to other methods when using eight A100 graphics processing units(GPUs). 展开更多
关键词 Distributed learning Large language model(LLM) Long sequence Machine learning system memory efficiency Tensor parallelism
原文传递
RevFB-BEV: Memory-Efficient Network With Reversible Swin Transformer for 3D BEV Object Detection
4
作者 Leilei Pan Yingnan Guo Yu Zhang 《IET Cyber-Systems and Robotics》 2025年第3期49-61,共13页
The perception of Bird's Eye View(BEV)has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency.However,the increasing complexity of neural network architectures ha... The perception of Bird's Eye View(BEV)has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency.However,the increasing complexity of neural network architectures has resulted in higher training memory,thereby limiting the scalability of model training.To address these challenges,we propose a novel model,RevFB-BEV,which is based on the Reversible Swin Transformer(RevSwin)with Forward-Backward View Transformation(FBVT)and LiDAR Guided Back Projection(LGBP).This approach includes the RevSwin backbone network,which employs a reversible architecture to minimise training memory by recomputing intermediate parameters.Moreover,we introduce the FBVT module that refines BEV features extracted from forward projection,yielding denser and more precise camera BEV representations.The LGBP module further utilises LiDAR BEV guidance for back projection to achieve more accurate camera BEV features.Extensive experiments on the nuScenes dataset demonstrate notable performance improvements,with our model achieving over a 4 x reduction in training memory and a more than 12x decrease in single-backbone training memory.These efficiency gains become even more pronounced with deeper network architectures.Additionally,RevFB-BEV achieves 68.1 mAP(mean Average Precision)on the validation set and 68.9 mAP on the test set,which is nearly on par with the baseline BEVFusion,underscoring its effectiveness in resource-constrained scenarios. 展开更多
关键词 3D object detection Bird's Eye View(BEV) memory efficiency reversible architecture view transformation
原文传递
Fully invertible hyperbolic neural networks for segmenting large-scale surface and sub-surface data
5
作者 Bas Peters Eldad Haber Keegan Lensink 《Artificial Intelligence in Geosciences》 2024年第1期269-281,共13页
The large spatial/temporal/frequency scale of geoscience and remote-sensing datasets causes memory issues when using convolutional neural networks for(sub-)surface data segmentation.Recently developed fully reversible... The large spatial/temporal/frequency scale of geoscience and remote-sensing datasets causes memory issues when using convolutional neural networks for(sub-)surface data segmentation.Recently developed fully reversible or fully invertible networks can mostly avoid memory limitations by recomputing the states during the backward pass through the network.This results in a low and fixed memory requirement for storing network states,as opposed to the typical linear memory growth with network depth.This work focuses on a fully invertible network based on the telegraph equation.While reversibility saves the major amount of memory used in deep networks by the data,the convolutional kernels can take up most memory if fully invertible networks contain multiple invertible pooling/coarsening layers.We address the explosion of the number of convolutional kernels by combining fully invertible networks with layers that contain the convolutional kernels in a compressed form directly.A second challenge is that invertible networks output a tensor the same size as its input.This property prevents the straightforward application of invertible networks to applications that map between different input-output dimensions,need to map to outputs with more channels than present in the input data,or desire outputs that decrease/increase the resolution compared to the input data.However,we show that by employing invertible networks in a non-standard fashion,we can still use them for these tasks.Examples in hyperspectral land-use classification,airborne geophysical surveying,and seismic imaging illustrate that we can input large data volumes in one chunk and do not need to work on small patches,use dimensionality reduction,or employ methods that classify a patch to a single central pixel. 展开更多
关键词 Invertible neural networks Large scale deep learning memory efficient deep learning
在线阅读 下载PDF
Implementation of a Real-time JPEG2000 System Using DSPs for 2 Digital Cameras 被引量:1
6
作者 HA DAC BINH 《信息与电子工程》 2006年第3期215-220,共6页
This paper presents techniques and approaches capable of achieving a real-time JPEG2000 compressing system using DSP chips. We propose a three-DSP real-time parallel processing system using efficient memory management... This paper presents techniques and approaches capable of achieving a real-time JPEG2000 compressing system using DSP chips. We propose a three-DSP real-time parallel processing system using efficient memory management for discrete wavelet transform (DWT) and parallel-pass architecture for embedded block coding with optimized truncation (EBCOT). This system performs compression of 1392×1040 pixels monochrome images with the speed of 10 fps/camera of 2 digital still cameras and is proven to be a practical and efficient DSP solution. 展开更多
关键词 JPEG2000 DSP system efficient memory management lifting DWT
在线阅读 下载PDF
An efficient and long-lived quantum memory for quantum repeater
7
《Science Foundation in China》 CAS 2016年第4期39-39,共1页
A research team led by Prof.Pan Jianwei(潘建伟)and Prof.Bao Xiaohui(包小辉)at the University of Science and Technology of China,reported the successful realization of an efficient quantum light-matter interface with s... A research team led by Prof.Pan Jianwei(潘建伟)and Prof.Bao Xiaohui(包小辉)at the University of Science and Technology of China,reported the successful realization of an efficient quantum light-matter interface with sub-second lifetime,which can be used as an elementary unit to extend the distance of quantum communication through quantum repeater.This result was recently published in Nature 展开更多
关键词 time An efficient and long-lived quantum memory for quantum repeater NATURE 潘建伟
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部