As modern embedded systems are increasingly network connected,their protocol stacks expose themselves as a surface that is frequently attacked.While C-based implementations such as LwIP are efficient,their lack of mem...As modern embedded systems are increasingly network connected,their protocol stacks expose themselves as a surface that is frequently attacked.While C-based implementations such as LwIP are efficient,their lack of memory safety induces critical vulnerabilities such as buffer overflows,dangling pointers,and use-after-free,leading to remote code execution or privilege escalation.In this paper,we present LwRustIP,a memory-safe embedded networking stack reimplemented in Rust and compatible with LwIP.We also share our development experience.LwRustIP replaces unsafe linked-list memory management with a custom allocator that honors the Rust ownership semantics,leverages zero-copy techniques for inter-layer packet handoffs,and applies lock-free object pools for concurrent buffer management.These design choices ensure memory safety while maintaining performance comparable to traditional C-based implementations.We deploy LwRustIP on ARM-based embedded platforms and evaluate its correctness,performance,and memory safety.Experimental results show that LwRustIP achieves memory safety without incurring measurable performance overhead compared to the original C-based implementation.Our experience highlights the practical challenges and benefits of using Rust for low-level system components and offers guidance for future efforts in memory-safe reengineering of legacy C codebases.展开更多
Transformer models have become a cornerstone of various natural language processing(NLP)tasks.However,the substantial computational overhead during the inference remains a significant challenge,limiting their deployme...Transformer models have become a cornerstone of various natural language processing(NLP)tasks.However,the substantial computational overhead during the inference remains a significant challenge,limiting their deployment in practical applications.In this study,we address this challenge by minimizing the inference overhead in transformer models using the controlling element on artificial intelligence(AI)accelerators.Our work is anchored by four key contributions.First,we conduct a comprehensive analysis of the overhead composition within the transformer inference process,identifying the primary bottlenecks.Second,we leverage the management processing element(MPE)of the Shenwei AI(SWAI)accelerator,implementing a three-tier scheduling framework that significantly reduces the number of host-device launches to approximately 1/10000 of the original PyTorch-GPU setup.Third,we introduce a zero-copy memory management technique using segment-page fusion,which significantly reduces memory access latency and improves overall inference efficiency.Finally,we develop a fast model loading method that eliminates redundant computations during model verification and initialization,reducing the total loading time for large models from 22128.31 ms to 1041.72 ms.Our contributions significantly enhance the optimization of transformer models,enabling more efficient and expedited inference processes on AI accelerators.展开更多
基金supported by National Key Research and Development Program of China(2022YFB4502001)National Natural Science Foundation of China(62402291,62302265,U23A20332)Shandong Province Natural Science Foundation(ZR2023QF172,2024HWYQ-020).
文摘As modern embedded systems are increasingly network connected,their protocol stacks expose themselves as a surface that is frequently attacked.While C-based implementations such as LwIP are efficient,their lack of memory safety induces critical vulnerabilities such as buffer overflows,dangling pointers,and use-after-free,leading to remote code execution or privilege escalation.In this paper,we present LwRustIP,a memory-safe embedded networking stack reimplemented in Rust and compatible with LwIP.We also share our development experience.LwRustIP replaces unsafe linked-list memory management with a custom allocator that honors the Rust ownership semantics,leverages zero-copy techniques for inter-layer packet handoffs,and applies lock-free object pools for concurrent buffer management.These design choices ensure memory safety while maintaining performance comparable to traditional C-based implementations.We deploy LwRustIP on ARM-based embedded platforms and evaluate its correctness,performance,and memory safety.Experimental results show that LwRustIP achieves memory safety without incurring measurable performance overhead compared to the original C-based implementation.Our experience highlights the practical challenges and benefits of using Rust for low-level system components and offers guidance for future efforts in memory-safe reengineering of legacy C codebases.
文摘Transformer models have become a cornerstone of various natural language processing(NLP)tasks.However,the substantial computational overhead during the inference remains a significant challenge,limiting their deployment in practical applications.In this study,we address this challenge by minimizing the inference overhead in transformer models using the controlling element on artificial intelligence(AI)accelerators.Our work is anchored by four key contributions.First,we conduct a comprehensive analysis of the overhead composition within the transformer inference process,identifying the primary bottlenecks.Second,we leverage the management processing element(MPE)of the Shenwei AI(SWAI)accelerator,implementing a three-tier scheduling framework that significantly reduces the number of host-device launches to approximately 1/10000 of the original PyTorch-GPU setup.Third,we introduce a zero-copy memory management technique using segment-page fusion,which significantly reduces memory access latency and improves overall inference efficiency.Finally,we develop a fast model loading method that eliminates redundant computations during model verification and initialization,reducing the total loading time for large models from 22128.31 ms to 1041.72 ms.Our contributions significantly enhance the optimization of transformer models,enabling more efficient and expedited inference processes on AI accelerators.