Wireless communication-enabled Cooperative Adaptive Cruise Control(CACC)is expected to improve the safety and traffic capacity of vehicle platoons.Existing CACC considers a conventional communication delay with fixed ...Wireless communication-enabled Cooperative Adaptive Cruise Control(CACC)is expected to improve the safety and traffic capacity of vehicle platoons.Existing CACC considers a conventional communication delay with fixed Vehicular Communication Network(VCN)topologies.However,when the network is under attack,the communication delay may be much higher,and the stability of the system may not be guaranteed.This paper proposes a novel communication Delay Aware CACC with Dynamic Network Topologies(DADNT).The main idea is that for various communication delays,in order to maximize the traffic capacity while guaranteeing stability and minimizing the following error,the CACC should dynamically adjust the VCN network topology to achieve the minimum inter-vehicle spacing.To this end,a multi-objective optimization problem is formulated,and a 3-step Divide-And-Conquer sub-optimal solution(3DAC)is proposed.Simulation results show that with 3DAC,the proposed DADNT with CACC can reduce the inter-vehicle spacing by 5%,10%,and 14%,respectively,compared with the traditional CACC with fixed one-vehicle,two-vehicle,and three-vehicle look-ahead network topologies,thereby improving the traffic efficiency.展开更多
The advent of Grover’s algorithm presents a significant threat to classical block cipher security,spurring research into post-quantum secure cipher design.This study engineers quantum circuit implementations for thre...The advent of Grover’s algorithm presents a significant threat to classical block cipher security,spurring research into post-quantum secure cipher design.This study engineers quantum circuit implementations for three versions of the Ballet family block ciphers.The Ballet‑p/k includes a modular-addition operation uncommon in lightweight block ciphers.Quantum ripple-carry adder is implemented for both“32+32”and“64+64”scale to support this operation.Subsequently,qubits,quantum gates count,and quantum circuit depth of three versions of Ballet algorithm are systematically evaluated under quantum computing model,and key recovery attack circuits are constructed based on Grover’s algorithm against each version.The comprehensive analysis shows:Ballet-128/128 fails to NIST Level 1 security,while when the resource accounting is restricted to the Clifford gates and T gates set for the Ballet-128/256 and Ballet-256/256 quantum circuits,the design attains Level 3.展开更多
LEO satellite communication systems have the characteristics of high-speed and periodic movement.The handover of user link occurs frequently,which has a serious impact on user terminal application and system capacity....LEO satellite communication systems have the characteristics of high-speed and periodic movement.The handover of user link occurs frequently,which has a serious impact on user terminal application and system capacity.To address this issue,we propose a handover strategy of LEO satellite user terminal based on multi-attribute and multi-point(MAMP)cooperation.Firstly,the satellite-user-time matrix is established by using the satellite constellation coverage and handover model.Then,combined with the visual time and signal quality,the user access matrix and satellite load matrix are extracted to determine the weight equation of the handover strategy with the channel reservation.According to the system modeling simulation,the algorithm improves the handover success rate by 2.5%,the lasted call access success rate by 3.2%,the load balancing degree by 20%,and the robustness by two orders of magnitude.展开更多
With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and c...With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and complex tasks of accelerators have posed significant challenges.Tra-ditional search methods can become prohibitively slow if the search space continues to be expanded.A design space exploration(DSE)method is proposed based on transfer learning,which reduces the time for repeated training and uses multi-task models for different tasks on the same processor.The proposed method accurately predicts the latency and energy consumption associated with neural net-work accelerator design parameters,enabling faster identification of optimal outcomes compared with traditional methods.And compared with other DSE methods by using multilayer perceptron(MLP),the required training time is shorter.Comparative experiments with other methods demonstrate that the proposed method improves the efficiency of DSE without compromising the accuracy of the re-sults.展开更多
The Sequential Task Flow(STF)model guides task parallelism by dynamically analyzing data dependencies at runtime,making it well-suited to handle dynamic and irregular parallelism.However,it introduces additional depen...The Sequential Task Flow(STF)model guides task parallelism by dynamically analyzing data dependencies at runtime,making it well-suited to handle dynamic and irregular parallelism.However,it introduces additional dependency tracking overhead.As task granularity becomes increasingly fine-grained or hardware parallelism increases,the traditional Centralized TDG Building(CB)algorithm progressively becomes a performance bottleneck.The Parallel TDG Building algorithm with Helpers(PBH),which leverages hardware message-passing mechanisms,has achieved significant speedups on the SW26010 platform,but its intensive sub-microsecond irregular synchronizations make it difficult to scale on cache-coherent multicore platforms.This paper proposes Cache-friendly PBH(CPBH),a parallel dependency tracking algorithm optimized for cache-coherent architectures.CPBH introduces a locality-aware lock-free batch synchronization mechanism that reduces the overhead of atomic operation contention and improves data access locality.Additionally,it employs an asynchronous execution strategy to overlap dependency tracking and task graph execution using dynamic reference counting.Experiments on three cache-coherent multicore platforms using 10 HPC benchmarks demonstrate that CPBH achieves an average speedup exceeding 1.4×compared to CB and over 1.2×speedup compared to DDAST under fine-grained scenarios.展开更多
1.Introduction The rapid expansion of satellite constellations in recent years has resulted in the generation of massive amounts of data.This surge in data,coupled with diverse application scenarios,underscores the es...1.Introduction The rapid expansion of satellite constellations in recent years has resulted in the generation of massive amounts of data.This surge in data,coupled with diverse application scenarios,underscores the escalating demand for high-performance computing over space.Computing over space entails the deployment of computational resources on platforms such as satellites to process large-scale data under constraints such as high radiation exposure,restricted power consumption,and minimized weight.展开更多
In covert communications,joint jammer selection and power optimization are important to improve performance.However,existing schemes usually assume a warden with a known location and perfect Channel State Information(...In covert communications,joint jammer selection and power optimization are important to improve performance.However,existing schemes usually assume a warden with a known location and perfect Channel State Information(CSI),which is difficult to achieve in practice.To be more practical,it is important to investigate covert communications against a warden with uncertain locations and imperfect CSI,which makes it difficult for legitimate transceivers to estimate the detection probability of the warden.First,the uncertainty caused by the unknown warden location must be removed,and the Optimal Detection Position(OPTDP)of the warden is derived which can provide the best detection performance(i.e.,the worst case for a covert communication).Then,to further avoid the impractical assumption of perfect CSI,the covert throughput is maximized using only the channel distribution information.Given this OPTDP based worst case for covert communications,the jammer selection,the jamming power,the transmission power,and the transmission rate are jointly optimized to maximize the covert throughput(OPTDP-JP).To solve this coupling problem,a Heuristic algorithm based on Maximum Distance Ratio(H-MAXDR)is proposed to provide a sub-optimal solution.First,according to the analysis of the covert throughput,the node with the maximum distance ratio(i.e.,the ratio of the distances from the jammer to the receiver and that to the warden)is selected as the friendly jammer(MAXDR).Then,the optimal transmission and jamming power can be derived,followed by the optimal transmission rate obtained via the bisection method.In numerical and simulation results,it is shown that although the location of the warden is unknown,by assuming the OPTDP of the warden,the proposed OPTDP-JP can always satisfy the covertness constraint.In addition,with an uncertain warden and imperfect CSI,the covert throughput provided by OPTDP-JP is 80%higher than the existing schemes when the covertness constraint is 0.9,showing the effectiveness of OPTDP-JP.展开更多
Synthetic aperture radar(SAR)radio frequency identification(RFID)localization is widely used for automated guided vehicles(AGVs)in the industrial internet of things(IIoT).However,the AGV’s speeds are limited by the p...Synthetic aperture radar(SAR)radio frequency identification(RFID)localization is widely used for automated guided vehicles(AGVs)in the industrial internet of things(IIoT).However,the AGV’s speeds are limited by the phase difference(PD)of two neighboring readers.In this paper,an inertial navigation system(INS)based SAR RFID localization method(ISRL)where AGV moves nonlinearly.To relax the speed limitation,a new phase-unwrapping method based on the similarity of PDs(PU-SPD)is proposed to deal with the PD ambiguity when the AGV speed exceeds 60km/h.In localization,the gauss-newton algorithm(GN)is employed and an initial value estimation scheme based on variable substitution(IVE-VS)is proposed to improve its positioning accuracy and the convergence rate.Thus,ISRL is a combination of IVE-VS and GN.Moreover,the Cramer-Rao lower bound(CRLB)and the speed limitation is derived.Simulation results show that the ISRL can converge after two iterations,and the positioning accuracy can achieve 7.50cm at a phase noise levelσ=0.18,which is 35%better than the Hyperbolic unbiased estimation localization(HyUnb).展开更多
The shadow tomography problem introduced by[1]is an important problem in quantum computing.Given an unknown-qubit quantum state,the goal is to estimate tr■,...,tr■using as least copies of■as possible,within an addi...The shadow tomography problem introduced by[1]is an important problem in quantum computing.Given an unknown-qubit quantum state,the goal is to estimate tr■,...,tr■using as least copies of■as possible,within an additive error of,whereF1,...,FM are known-outcome measurements.In this paper,we consider the shadow tomography problem with a potentially inaccurate prediction■of the true state■.This corresponds to practical cases where we possess prior knowledge of the unknown state.For example,in quantum verification or calibration,we may be aware of the quantum state that the quantum device is expected to generate.However,the actual state it generates may have deviations.We introduce an algorithm with sample complexity■(nmax{■ε}log2M/ε4.In the generic case,even if the prediction can be arbitrarily bad,our algorithm has the same complexity as the best algorithm without prediction[2].At the same time,as the prediction quality improves,the sample complexity can be reduced smoothly to■(nlog2M/ε3)when the trace distance between the prediction and the unknown state is■(ε).Furthermore,we conduct numerical experiments to validate our theoretical analysis.The experiments are constructed to simulate noisy quantum circuits that reflect possible real scenarios in quantum verification or calibration.Notably,our algorithm outperforms the previous work without prediction in most settings.展开更多
We have optimized the parallel threshold ILU algorithm(ParILUT)for GPUs.The optimizations are for three building blocks:candidate search and ILU residual computation,adding and removing elements,and threshold selectio...We have optimized the parallel threshold ILU algorithm(ParILUT)for GPUs.The optimizations are for three building blocks:candidate search and ILU residual computation,adding and removing elements,and threshold selection.Firstly,we fuse candidate search and ILU residual computation by modifying the ParILUT algorithm and extending the register-aware SpGEMM algorithm to calculate it.At the same time,we developed a GPU bin search algorithm to make the register-aware SpGEMM algorithm perform better in ParILUT.Secondly,we adopt a warp-row-parallel approach to add elements to new L and U and remove elements from candidates instead of the thread-row-parallel approach.And used the efficient GPU instructions to locate the positions of elements.Thirdly,we proposed a balanced classification tree in the threshold selection to balance the buckets’data,when a large number of elements with the same value.Finally,we experimented with the performance of each optimization and the whole ParILUT.And verified the correctness of the optimized ParILUT.The result indicates that the optimized ParILUT average speedup is 4.03 times over the original version,and the speedup increases with the amount of fill-in.展开更多
Low-Earth-Orbit satellite constellation networks(LEO-SCN)can provide low-cost,largescale,flexible coverage wireless communication services.High dynamics and large topological sizes characterize LEO-SCN.Protocol develo...Low-Earth-Orbit satellite constellation networks(LEO-SCN)can provide low-cost,largescale,flexible coverage wireless communication services.High dynamics and large topological sizes characterize LEO-SCN.Protocol development and application testing of LEO-SCN are challenging to carry out in a natural environment.Simulation platforms are a more effective means of technology demonstration.Currently available simulators have a single function and limited simulation scale.There needs to be a simulator for full-featured simulation.In this paper,we apply the parallel discrete-event simulation technique to the simulation of LEO-SCN to support large-scale complex system simulation at the packet level.To solve the problem that single-process programs cannot cope with complex simulations containing numerous entities,we propose a parallel mechanism and algorithms LP-NM and LP-YAWNS for synchronization.In the experiment,we use ns-3 to verify the acceleration ratio and efficiency of the above algorithms.The results show that our proposed mechanism can provide parallel simulation engine support for the LEO-SCN.展开更多
A novel quantum algorithm for the Mastermind game was proposed recently by a research team from Sun Yat-sen University to highlight the power of quantum computing.Mastermind is a popular code-breaking game between a c...A novel quantum algorithm for the Mastermind game was proposed recently by a research team from Sun Yat-sen University to highlight the power of quantum computing.Mastermind is a popular code-breaking game between a codemaker and a codebreaker.In the commercial version,the codemaker selects a secret sequence of four colored pegs(positions)from six possible colors.展开更多
With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heter...With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heterogeneous programming,has gained widespread adoption,its mainstream implementations(such as DPC++and AdaptiveCpp)primarily target SIMT-architecture devices like GPUs,presenting substantial challenges when adapting to specialized accelerators such as the Cambricon MLU,which employs a fundamentally different SIMD execution model.This cross-programming-model extension encounters two critical challenges:(1)bridging the programming abstraction gap between SIMT’s thread-level parallelism and SIMD’s data-level parallelism;and(2)harmonizing SYCL’s unified memory model with device-specific memory architectures.This paper proposes a novel cross-programming-model SYCL extension methodology to achieve full SYCL support for SIMD architectures,demonstrated through a comprehensive implementation for the Cambricon MLU platform.Our approach introduces MLU-specific vector programming interfaces while maintaining compatibility with the SYCL standard,enabling seamless integration of SIMD-based accelerators into the SYCL ecosystem.To validate our methodology,we integrated the extended SYCL-MLU implementation into PaddlePaddle’s CINN compiler,achieving a geometric mean performance improvement of 9.14%across representative neural networks,including ResNet,YOLOv3,and BERT.This research significantly broadens the application scope of SYCL in heterogeneous programming and provides a systematic methodology for extending SYCL to other SIMD-based hardware platforms.展开更多
Despite advancements in computer hardware,the performance of GROMACS simulations has not exhibited significant improvement,primarily due to the inefficient utilization of substantial hardware resources.Enhancing resou...Despite advancements in computer hardware,the performance of GROMACS simulations has not exhibited significant improvement,primarily due to the inefficient utilization of substantial hardware resources.Enhancing resource utilization in GROMACS simulations can be achieved through effective resource scheduling when running multiple simulations concur-rently on a single computing node,particularly benefiting small-scale system simulations which are frequently employed.Previous research focused on co-running multiple GROMACS simulations through the utilization of time-slice technology.However,this approach introduced notable context-switching overhead and predominantly concentrated on optimizing GPU resources utilization,while neglecting the collaborative scheduling of heterogeneous CPU and GPU devices.Nowadays,various GPU vendors have introduced hardware partitioning technologies for spatial resources allocation,complementing traditional time-sharing techniques.Moreover,GROMACS operates as a heterogeneous computing application,alternating computations between the CPU and GPU devices.Notably,GPU utilization sometimes accounts for as little as 35%.Conse-quently,a comprehensive approach involving coordinated scheduling between both the GPU and CPU is imperative.To lever-age the potential of hardware partitioning technologies in alignment with GROMACS’runtime characteristics,we propose FILL:a resource scheduling system designed for co-running multiple GROMACS jobs.FILL employs space partitioning technology to effectively allocate hardware resources and facilitates collaborative scheduling of CPU and GPU resources,thereby ensuring precise and deterministic allocation of GROMACS job resources.The scheduling aims to improve system throughput while considering the turnaround time of simulations.Implemented on servers equipped with NVIDIA and AMD GPUs,FILL has showcased noteworthy advancements in system throughput.On NVIDIA GPU servers,FILL achieved an impressive improvement of up to 167%compared to the baseline approach and an astonishing boost of 27,928%compared to state-of-the-art alternatives.Similarly,on AMD GPU servers,FILL demonstrated significant enhancements of 459%and 24%over the baseline and state-of-the-art methods,respectively.These remarkable results validate the effectiveness of FILL in optimizing system throughput for multiple GROMACS simulations.展开更多
Deep potential(DP)scheme has increased the simulation temporal and spatial scales while maintaining the ab initio accuracy of the molecular dynamics.DeePMD-kit is an outstanding application that implements DP scheme e...Deep potential(DP)scheme has increased the simulation temporal and spatial scales while maintaining the ab initio accuracy of the molecular dynamics.DeePMD-kit is an outstanding application that implements DP scheme efficiently.However,current performance model cannot accurately measure the resource utilization of DeePMD-kit operators and predict the execution time.We introduce DP-perf,an interpretable performance model for DeePMD-kit.DP-perf can accurately measure the resource utilization of the individual DeePMD-kit operators,communication pattern,and the overall application by exploiting physical system properties and machine configurations.It can be easily applied to mainstream supercomputers including Tianhe-3F,the new Sunway,Fugaku,and Summit.With DP-perf,users can select the optimal machine and decide the corresponding configuration for various purposes(e.g.,lower cost,less time)without real runs.Evaluation of four top supercomputers shows that DP-perf can fit overall execution time with a low mean absolute percentage error of 5.7%/8.1%/14.3%/13.1%on Tianhe-3F/new Sunway/Fugaku/Summit.On the prediction scenario,DP-perf can predict the total execution time with a mean absolute percentage error of less than 20%.展开更多
Nowadays,with the increasing depth of CNNs,the number of computation and storage requirements with weights expands significantly,preventing their wide deployment on resource-constrained application scenarios such as e...Nowadays,with the increasing depth of CNNs,the number of computation and storage requirements with weights expands significantly,preventing their wide deployment on resource-constrained application scenarios such as embedded systems.To improve the efficiency of the current deep CNN inference stage,researchers have attempted to explore weight pruning techniques on CNN accelerators(e.g.,systolic arrays)to avoid the number of unimportant weights storage and computation.However,these attempts either suffer expensive extra hardware costs to encode/decode the irregular sparse weight pattern on accelerators or bring finite performance improvement due to structured pruning’s modest compression ratio.In order to address the above challenge,this paper proposes FASS-Pruner,a Fine-grained Accelerator-aware pruning framework via intra-filter Splitting and inter-filter Shuffling:(1)Considering the round-by-round execution behavior of CNN accelerator,FASS-Pruner split filters into multiple rounds to perform column-wise-weight pruning;(2)Leveraging the calculation independence characteristics across filters on CNN accelerators,FASS-Pruner shuffles the filters to prune the unimportant rowwise weights at CNN accelerator.Combining the sparse pattern of pruned CNN and the dataflow of systolic array,we modify the systolic array-based accelerator to enable it to execute pruned sparse CNN with better performance and lower energy consumption.By condensing the pruned sparse weights in systolic arrays,FASS-Pruner achieves a comparable pruning ratio while preserving the original data flow of CNN accelerators,thereby achieving significant performance and energy saving.展开更多
The high failure rates in clinical drug development based on animal models highlight the urgent need for more representative human models in biomedical research.In response to this demand,organoids and organ chips wer...The high failure rates in clinical drug development based on animal models highlight the urgent need for more representative human models in biomedical research.In response to this demand,organoids and organ chips were integrated for greater physiological relevance and dynamic,controlled experimental conditions.This innovative platform—the organoids-on-a-chip technology—shows great promise in disease modeling,drug discovery,and personalized medicine,attracting interest from researchers,clinicians,regulatory authorities,and industry stakeholders.This review traces the evolution from organoids to organoids-on-a-chip,driven by the necessity for advanced biological models.We summarize the applications of organoids-on-a-chip in simulating physiological and pathological phenotypes and therapeutic evaluation of this technology.This section highlights how integrating technologies from organ chips,such as microfluidic systems,mechanical stimulation,and sensor integration,optimizes organoid cell types,spatial structure,and physiological functions,thereby expanding their biomedical applications.We conclude by addressing the current challenges in the development of organoids-on-a-chip and offering insights into the prospects.The advancement of organoids-on-a-chip is poised to enhance fidelity,standardization,and scalability.Furthermore,the integration of cutting-edge technologies and interdisciplinary collaborations will be crucial for the progression of organoids-on-a-chip technology.展开更多
A novel quantum search algorithm tailored for continuous optimization and spectral problems was proposed recently by a research team from the University of Electronic Science and Technology of China to broaden quantum...A novel quantum search algorithm tailored for continuous optimization and spectral problems was proposed recently by a research team from the University of Electronic Science and Technology of China to broaden quantum computation frontiers and enrich its application landscape.Quantum computing has traditionally excelled at tackling discrete search challenges,but many important applications from large-scale optimization to advanced physics simulations necessitate searching through continuous domains.These continuous search problems involve uncountably infinite solution spaces and bring about computational complexities far beyond those faced in conventional discrete settings.This draft,titled“Fixed-Point Quantum Continuous Search Algorithm with Optimal Query Complexity”,takes on the core challenge of performing search tasks in domains that may be uncountably infinite,offering theoretical and practical insights into achieving quantum speedups in such settings[1].展开更多
In the field of natural language processing,the rapid development of large language model(LLM)has attracted increasing attention.LLMs have shown a high level of creativity in various tasks,but the methods for assessin...In the field of natural language processing,the rapid development of large language model(LLM)has attracted increasing attention.LLMs have shown a high level of creativity in various tasks,but the methods for assessing such creativity are inadequate.Assessment of LLM creativity needs to consider differences from humans,requiring multiple dimensional measurement while balancing accuracy and efficiency.This paper aims to establish an efficient framework for assessing the level of creativity in LLMs.By adapting the modified Torrance tests of creative thinking,the research evaluates the creative performance of various LLMs across 7 tasks,emphasizing 4 criteria including fluency,flexibility,originality,and elaboration.In this context,we develop a comprehensive dataset of 700 questions for testing and an LLM-based evaluation method.In addition,this study presents a novel analysis of LLMs'responses to diverse prompts and role-play situations.We found that the creativity of LLMs primarily falls short in originality,while excelling in elaboration.In addition,the use of prompts and role-play settings of the model significantly influence creativity.Additionally,the experimental results also indicate that collaboration among multiple LLMs can enhance originality.Notably,our findings reveal a consensus between human evaluations and LLMs regarding the personality traits that influence creativity.The findings underscore the significant impact of LLM design on creativity and bridge artificial intelligence and human creativity,offering insights into LLMs'creativity and potential applications.展开更多
Submodular maximization is a significant area of interest in combinatorial optimization.It has various real-world applications.In recent years,streaming algorithms for submodular maximization have gained attention,all...Submodular maximization is a significant area of interest in combinatorial optimization.It has various real-world applications.In recent years,streaming algorithms for submodular maximization have gained attention,allowing realtime processing of large data sets by examining each piece of data only once.However,most of the current state-of-the-art algorithms are only applicable to monotone submodular maximization.There are still significant gaps in the approximation ratios between monotone and non-monotone objective functions.In this paper,we propose a streaming algorithm framework for non-monotone submodular maximization and use this framework to design deterministic streaming algorithms for the d-knapsack constraint and the knapsack constraint.Our 1-pass streaming algorithm for the d-knapsack constraint has a 1/4(d+1)-∈approximation ratio,using O(BlogB/∈)memory,and O(logB/∈)query time per element,where B=MIN(n,b)is the maximum number of elements that the knapsack can store.As a special case of the d-knapsack constraint,we have the 1-pass streaming algorithm with a 1/8-∈approximation ratio to the knapsack constraint.To our knowledge,there is currently no streaming algorithm for this constraint when the objective function is non-monotone,even when d=1.In addition,we propose a multi-pass streaming algorithm with 1/6-∈approximation,which stores O(B)elements.展开更多
基金supported by the National Natural Science Foundation of China under Grant U21A20449in part by Jiangsu Provincial Key Research and Development Program under Grant BE2021013-2。
文摘Wireless communication-enabled Cooperative Adaptive Cruise Control(CACC)is expected to improve the safety and traffic capacity of vehicle platoons.Existing CACC considers a conventional communication delay with fixed Vehicular Communication Network(VCN)topologies.However,when the network is under attack,the communication delay may be much higher,and the stability of the system may not be guaranteed.This paper proposes a novel communication Delay Aware CACC with Dynamic Network Topologies(DADNT).The main idea is that for various communication delays,in order to maximize the traffic capacity while guaranteeing stability and minimizing the following error,the CACC should dynamically adjust the VCN network topology to achieve the minimum inter-vehicle spacing.To this end,a multi-objective optimization problem is formulated,and a 3-step Divide-And-Conquer sub-optimal solution(3DAC)is proposed.Simulation results show that with 3DAC,the proposed DADNT with CACC can reduce the inter-vehicle spacing by 5%,10%,and 14%,respectively,compared with the traditional CACC with fixed one-vehicle,two-vehicle,and three-vehicle look-ahead network topologies,thereby improving the traffic efficiency.
基金State Key Lab of Processors,Institute of Computing Technology,Chinese Academy of Sciences(CLQ202516)the Fundamental Research Funds for the Central Universities of China(3282025047,3282024051,3282024009)。
文摘The advent of Grover’s algorithm presents a significant threat to classical block cipher security,spurring research into post-quantum secure cipher design.This study engineers quantum circuit implementations for three versions of the Ballet family block ciphers.The Ballet‑p/k includes a modular-addition operation uncommon in lightweight block ciphers.Quantum ripple-carry adder is implemented for both“32+32”and“64+64”scale to support this operation.Subsequently,qubits,quantum gates count,and quantum circuit depth of three versions of Ballet algorithm are systematically evaluated under quantum computing model,and key recovery attack circuits are constructed based on Grover’s algorithm against each version.The comprehensive analysis shows:Ballet-128/128 fails to NIST Level 1 security,while when the resource accounting is restricted to the Clifford gates and T gates set for the Ballet-128/256 and Ballet-256/256 quantum circuits,the design attains Level 3.
基金supported by the Innovation Funding of ICT,CAS under Grant(No.E261020)Jiangsu Key Research and Development Program of China(No.BE2021013-2)Zhejiang Key Research and Development Program(No.2021C01040).
文摘LEO satellite communication systems have the characteristics of high-speed and periodic movement.The handover of user link occurs frequently,which has a serious impact on user terminal application and system capacity.To address this issue,we propose a handover strategy of LEO satellite user terminal based on multi-attribute and multi-point(MAMP)cooperation.Firstly,the satellite-user-time matrix is established by using the satellite constellation coverage and handover model.Then,combined with the visual time and signal quality,the user access matrix and satellite load matrix are extracted to determine the weight equation of the handover strategy with the channel reservation.According to the system modeling simulation,the algorithm improves the handover success rate by 2.5%,the lasted call access success rate by 3.2%,the load balancing degree by 20%,and the robustness by two orders of magnitude.
基金the National Key R&D Program of China(No.2018AAA0103300)the National Natural Science Foundation of China(No.61925208,U20A20227,U22A2028)+1 种基金the Chinese Academy of Sciences Project for Young Scientists in Basic Research(No.YSBR-029)the Youth Innovation Promotion Association Chinese Academy of Sciences.
文摘With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and complex tasks of accelerators have posed significant challenges.Tra-ditional search methods can become prohibitively slow if the search space continues to be expanded.A design space exploration(DSE)method is proposed based on transfer learning,which reduces the time for repeated training and uses multi-task models for different tasks on the same processor.The proposed method accurately predicts the latency and energy consumption associated with neural net-work accelerator design parameters,enabling faster identification of optimal outcomes compared with traditional methods.And compared with other DSE methods by using multilayer perceptron(MLP),the required training time is shorter.Comparative experiments with other methods demonstrate that the proposed method improves the efficiency of DSE without compromising the accuracy of the re-sults.
基金supported in part by the National Key Research and Development Program of China(2024YFB4505701)National Natural Science Foundation of China(62090024)。
文摘The Sequential Task Flow(STF)model guides task parallelism by dynamically analyzing data dependencies at runtime,making it well-suited to handle dynamic and irregular parallelism.However,it introduces additional dependency tracking overhead.As task granularity becomes increasingly fine-grained or hardware parallelism increases,the traditional Centralized TDG Building(CB)algorithm progressively becomes a performance bottleneck.The Parallel TDG Building algorithm with Helpers(PBH),which leverages hardware message-passing mechanisms,has achieved significant speedups on the SW26010 platform,but its intensive sub-microsecond irregular synchronizations make it difficult to scale on cache-coherent multicore platforms.This paper proposes Cache-friendly PBH(CPBH),a parallel dependency tracking algorithm optimized for cache-coherent architectures.CPBH introduces a locality-aware lock-free batch synchronization mechanism that reduces the overhead of atomic operation contention and improves data access locality.Additionally,it employs an asynchronous execution strategy to overlap dependency tracking and task graph execution using dynamic reference counting.Experiments on three cache-coherent multicore platforms using 10 HPC benchmarks demonstrate that CPBH achieves an average speedup exceeding 1.4×compared to CB and over 1.2×speedup compared to DDAST under fine-grained scenarios.
基金supported in part by the National Natural Science Foundation of China(62025404)in part by the National Key Research and Development Program of China(2022YFB3902802)+1 种基金in part by the Beijing Natural Science Foundation(L241013)in part by the Strategic Priority Research Program of the Chinese Academy of Sciences(XDA000000).
文摘1.Introduction The rapid expansion of satellite constellations in recent years has resulted in the generation of massive amounts of data.This surge in data,coupled with diverse application scenarios,underscores the escalating demand for high-performance computing over space.Computing over space entails the deployment of computational resources on platforms such as satellites to process large-scale data under constraints such as high radiation exposure,restricted power consumption,and minimized weight.
基金supported by the CAS Project for Young Scientists in Basic Research under Grant YSBR-035Jiangsu Provincial Key Research and Development Program under Grant BE2021013-2.
文摘In covert communications,joint jammer selection and power optimization are important to improve performance.However,existing schemes usually assume a warden with a known location and perfect Channel State Information(CSI),which is difficult to achieve in practice.To be more practical,it is important to investigate covert communications against a warden with uncertain locations and imperfect CSI,which makes it difficult for legitimate transceivers to estimate the detection probability of the warden.First,the uncertainty caused by the unknown warden location must be removed,and the Optimal Detection Position(OPTDP)of the warden is derived which can provide the best detection performance(i.e.,the worst case for a covert communication).Then,to further avoid the impractical assumption of perfect CSI,the covert throughput is maximized using only the channel distribution information.Given this OPTDP based worst case for covert communications,the jammer selection,the jamming power,the transmission power,and the transmission rate are jointly optimized to maximize the covert throughput(OPTDP-JP).To solve this coupling problem,a Heuristic algorithm based on Maximum Distance Ratio(H-MAXDR)is proposed to provide a sub-optimal solution.First,according to the analysis of the covert throughput,the node with the maximum distance ratio(i.e.,the ratio of the distances from the jammer to the receiver and that to the warden)is selected as the friendly jammer(MAXDR).Then,the optimal transmission and jamming power can be derived,followed by the optimal transmission rate obtained via the bisection method.In numerical and simulation results,it is shown that although the location of the warden is unknown,by assuming the OPTDP of the warden,the proposed OPTDP-JP can always satisfy the covertness constraint.In addition,with an uncertain warden and imperfect CSI,the covert throughput provided by OPTDP-JP is 80%higher than the existing schemes when the covertness constraint is 0.9,showing the effectiveness of OPTDP-JP.
基金supported by the National Natural Science Foundation of China under Grant U21A20449The Zhongguancun Project under Grant 23120035.
文摘Synthetic aperture radar(SAR)radio frequency identification(RFID)localization is widely used for automated guided vehicles(AGVs)in the industrial internet of things(IIoT).However,the AGV’s speeds are limited by the phase difference(PD)of two neighboring readers.In this paper,an inertial navigation system(INS)based SAR RFID localization method(ISRL)where AGV moves nonlinearly.To relax the speed limitation,a new phase-unwrapping method based on the similarity of PDs(PU-SPD)is proposed to deal with the PD ambiguity when the AGV speed exceeds 60km/h.In localization,the gauss-newton algorithm(GN)is employed and an initial value estimation scheme based on variable substitution(IVE-VS)is proposed to improve its positioning accuracy and the convergence rate.Thus,ISRL is a combination of IVE-VS and GN.Moreover,the Cramer-Rao lower bound(CRLB)and the speed limitation is derived.Simulation results show that the ISRL can converge after two iterations,and the positioning accuracy can achieve 7.50cm at a phase noise levelσ=0.18,which is 35%better than the Hyperbolic unbiased estimation localization(HyUnb).
基金supported by the National Natural Science Foundation of China(Grant Nos.62325210,and 62272441)the Strategic Priority Research Program of Chinese Academy of Sciences(No.XDB28000000)+1 种基金supported by the National Natural Science Foundation of China(Grant Nos.62372006,92365117)the Fundamental Research Funds for the Central Universities,Peking University.
文摘The shadow tomography problem introduced by[1]is an important problem in quantum computing.Given an unknown-qubit quantum state,the goal is to estimate tr■,...,tr■using as least copies of■as possible,within an additive error of,whereF1,...,FM are known-outcome measurements.In this paper,we consider the shadow tomography problem with a potentially inaccurate prediction■of the true state■.This corresponds to practical cases where we possess prior knowledge of the unknown state.For example,in quantum verification or calibration,we may be aware of the quantum state that the quantum device is expected to generate.However,the actual state it generates may have deviations.We introduce an algorithm with sample complexity■(nmax{■ε}log2M/ε4.In the generic case,even if the prediction can be arbitrarily bad,our algorithm has the same complexity as the best algorithm without prediction[2].At the same time,as the prediction quality improves,the sample complexity can be reduced smoothly to■(nlog2M/ε3)when the trace distance between the prediction and the unknown state is■(ε).Furthermore,we conduct numerical experiments to validate our theoretical analysis.The experiments are constructed to simulate noisy quantum circuits that reflect possible real scenarios in quantum verification or calibration.Notably,our algorithm outperforms the previous work without prediction in most settings.
基金supported by the National Natural Science Foundation of China,under Grant 62172389.
文摘We have optimized the parallel threshold ILU algorithm(ParILUT)for GPUs.The optimizations are for three building blocks:candidate search and ILU residual computation,adding and removing elements,and threshold selection.Firstly,we fuse candidate search and ILU residual computation by modifying the ParILUT algorithm and extending the register-aware SpGEMM algorithm to calculate it.At the same time,we developed a GPU bin search algorithm to make the register-aware SpGEMM algorithm perform better in ParILUT.Secondly,we adopt a warp-row-parallel approach to add elements to new L and U and remove elements from candidates instead of the thread-row-parallel approach.And used the efficient GPU instructions to locate the positions of elements.Thirdly,we proposed a balanced classification tree in the threshold selection to balance the buckets’data,when a large number of elements with the same value.Finally,we experimented with the performance of each optimization and the whole ParILUT.And verified the correctness of the optimized ParILUT.The result indicates that the optimized ParILUT average speedup is 4.03 times over the original version,and the speedup increases with the amount of fill-in.
基金supported by Jiangsu Provincial Key Research and Development Program (No.BE20210132)the Zhejiang Provincial Key Research and Development Program (No.2021C01040)the team of S-SET
文摘Low-Earth-Orbit satellite constellation networks(LEO-SCN)can provide low-cost,largescale,flexible coverage wireless communication services.High dynamics and large topological sizes characterize LEO-SCN.Protocol development and application testing of LEO-SCN are challenging to carry out in a natural environment.Simulation platforms are a more effective means of technology demonstration.Currently available simulators have a single function and limited simulation scale.There needs to be a simulator for full-featured simulation.In this paper,we apply the parallel discrete-event simulation technique to the simulation of LEO-SCN to support large-scale complex system simulation at the packet level.To solve the problem that single-process programs cannot cope with complex simulations containing numerous entities,we propose a parallel mechanism and algorithms LP-NM and LP-YAWNS for synchronization.In the experiment,we use ns-3 to verify the acceleration ratio and efficiency of the above algorithms.The results show that our proposed mechanism can provide parallel simulation engine support for the LEO-SCN.
文摘A novel quantum algorithm for the Mastermind game was proposed recently by a research team from Sun Yat-sen University to highlight the power of quantum computing.Mastermind is a popular code-breaking game between a codemaker and a codebreaker.In the commercial version,the codemaker selects a secret sequence of four colored pegs(positions)from six possible colors.
基金supported by the Beijing Science and Technology Planning Project(Grant No.Z231100010323007).
文摘With the rapid advancement of artificial intelligence and high-performance computing,heterogeneous computing platforms have evolved to encompass increasingly diverse architectures.While SYCL,an open standard for heterogeneous programming,has gained widespread adoption,its mainstream implementations(such as DPC++and AdaptiveCpp)primarily target SIMT-architecture devices like GPUs,presenting substantial challenges when adapting to specialized accelerators such as the Cambricon MLU,which employs a fundamentally different SIMD execution model.This cross-programming-model extension encounters two critical challenges:(1)bridging the programming abstraction gap between SIMT’s thread-level parallelism and SIMD’s data-level parallelism;and(2)harmonizing SYCL’s unified memory model with device-specific memory architectures.This paper proposes a novel cross-programming-model SYCL extension methodology to achieve full SYCL support for SIMD architectures,demonstrated through a comprehensive implementation for the Cambricon MLU platform.Our approach introduces MLU-specific vector programming interfaces while maintaining compatibility with the SYCL standard,enabling seamless integration of SIMD-based accelerators into the SYCL ecosystem.To validate our methodology,we integrated the extended SYCL-MLU implementation into PaddlePaddle’s CINN compiler,achieving a geometric mean performance improvement of 9.14%across representative neural networks,including ResNet,YOLOv3,and BERT.This research significantly broadens the application scope of SYCL in heterogeneous programming and provides a systematic methodology for extending SYCL to other SIMD-based hardware platforms.
基金sponsored in part by NKRDP(2021YFB0300800)in part by NSFC(62102396)+2 种基金Beijing Nova Program(Z211100002121143,20220484217)Youth Innovation Promotion Association of Chinese Academy of Sciences(2021099)Pilot for Major Scientific Research Facility of Jiangsu Province of China(NO.BM2021800).
文摘Despite advancements in computer hardware,the performance of GROMACS simulations has not exhibited significant improvement,primarily due to the inefficient utilization of substantial hardware resources.Enhancing resource utilization in GROMACS simulations can be achieved through effective resource scheduling when running multiple simulations concur-rently on a single computing node,particularly benefiting small-scale system simulations which are frequently employed.Previous research focused on co-running multiple GROMACS simulations through the utilization of time-slice technology.However,this approach introduced notable context-switching overhead and predominantly concentrated on optimizing GPU resources utilization,while neglecting the collaborative scheduling of heterogeneous CPU and GPU devices.Nowadays,various GPU vendors have introduced hardware partitioning technologies for spatial resources allocation,complementing traditional time-sharing techniques.Moreover,GROMACS operates as a heterogeneous computing application,alternating computations between the CPU and GPU devices.Notably,GPU utilization sometimes accounts for as little as 35%.Conse-quently,a comprehensive approach involving coordinated scheduling between both the GPU and CPU is imperative.To lever-age the potential of hardware partitioning technologies in alignment with GROMACS’runtime characteristics,we propose FILL:a resource scheduling system designed for co-running multiple GROMACS jobs.FILL employs space partitioning technology to effectively allocate hardware resources and facilitates collaborative scheduling of CPU and GPU resources,thereby ensuring precise and deterministic allocation of GROMACS job resources.The scheduling aims to improve system throughput while considering the turnaround time of simulations.Implemented on servers equipped with NVIDIA and AMD GPUs,FILL has showcased noteworthy advancements in system throughput.On NVIDIA GPU servers,FILL achieved an impressive improvement of up to 167%compared to the baseline approach and an astonishing boost of 27,928%compared to state-of-the-art alternatives.Similarly,on AMD GPU servers,FILL demonstrated significant enhancements of 459%and 24%over the baseline and state-of-the-art methods,respectively.These remarkable results validate the effectiveness of FILL in optimizing system throughput for multiple GROMACS simulations.
基金supported by the following funding:the Strategic Priority Research Program of Chinese Academy of Sciences(No.XDB0500102)National Science Foundation of China(No.61972416,T2125013 and 92270206)+3 种基金China National Postdoctoral Program for Innovative Talents(No.BX20240383)the Natural Science Foundation of Shandong Province(No.ZR2022LZH009)GHfound C(No.202407035455)National Key R&D Program of China(No.2021YFA1000103-3).
文摘Deep potential(DP)scheme has increased the simulation temporal and spatial scales while maintaining the ab initio accuracy of the molecular dynamics.DeePMD-kit is an outstanding application that implements DP scheme efficiently.However,current performance model cannot accurately measure the resource utilization of DeePMD-kit operators and predict the execution time.We introduce DP-perf,an interpretable performance model for DeePMD-kit.DP-perf can accurately measure the resource utilization of the individual DeePMD-kit operators,communication pattern,and the overall application by exploiting physical system properties and machine configurations.It can be easily applied to mainstream supercomputers including Tianhe-3F,the new Sunway,Fugaku,and Summit.With DP-perf,users can select the optimal machine and decide the corresponding configuration for various purposes(e.g.,lower cost,less time)without real runs.Evaluation of four top supercomputers shows that DP-perf can fit overall execution time with a low mean absolute percentage error of 5.7%/8.1%/14.3%/13.1%on Tianhe-3F/new Sunway/Fugaku/Summit.On the prediction scenario,DP-perf can predict the total execution time with a mean absolute percentage error of less than 20%.
基金supported by the National Natural Science Foundation of China(NSFC)(Grants No.U19A2061,No.62272190)Sichuan Major R&D Project(Grant No.22QYCX0168).
文摘Nowadays,with the increasing depth of CNNs,the number of computation and storage requirements with weights expands significantly,preventing their wide deployment on resource-constrained application scenarios such as embedded systems.To improve the efficiency of the current deep CNN inference stage,researchers have attempted to explore weight pruning techniques on CNN accelerators(e.g.,systolic arrays)to avoid the number of unimportant weights storage and computation.However,these attempts either suffer expensive extra hardware costs to encode/decode the irregular sparse weight pattern on accelerators or bring finite performance improvement due to structured pruning’s modest compression ratio.In order to address the above challenge,this paper proposes FASS-Pruner,a Fine-grained Accelerator-aware pruning framework via intra-filter Splitting and inter-filter Shuffling:(1)Considering the round-by-round execution behavior of CNN accelerator,FASS-Pruner split filters into multiple rounds to perform column-wise-weight pruning;(2)Leveraging the calculation independence characteristics across filters on CNN accelerators,FASS-Pruner shuffles the filters to prune the unimportant rowwise weights at CNN accelerator.Combining the sparse pattern of pruned CNN and the dataflow of systolic array,we modify the systolic array-based accelerator to enable it to execute pruned sparse CNN with better performance and lower energy consumption.By condensing the pruned sparse weights in systolic arrays,FASS-Pruner achieves a comparable pruning ratio while preserving the original data flow of CNN accelerators,thereby achieving significant performance and energy saving.
基金supported by grants from the National Key Research and Development Program of China(No.2024YFA1108302)National Key Research and Development Program of China(No.2021YFA1101400)+2 种基金Strategic Priority Research Program of CAS(No.XDA 0460205)Open Project of Key Laboratory of Organ Regeneration and Intelligent Manufacturing(No.2024KF31)Basic Frontier Science Research Program of CAS(No.ZDBS-LY-SM024).
文摘The high failure rates in clinical drug development based on animal models highlight the urgent need for more representative human models in biomedical research.In response to this demand,organoids and organ chips were integrated for greater physiological relevance and dynamic,controlled experimental conditions.This innovative platform—the organoids-on-a-chip technology—shows great promise in disease modeling,drug discovery,and personalized medicine,attracting interest from researchers,clinicians,regulatory authorities,and industry stakeholders.This review traces the evolution from organoids to organoids-on-a-chip,driven by the necessity for advanced biological models.We summarize the applications of organoids-on-a-chip in simulating physiological and pathological phenotypes and therapeutic evaluation of this technology.This section highlights how integrating technologies from organ chips,such as microfluidic systems,mechanical stimulation,and sensor integration,optimizes organoid cell types,spatial structure,and physiological functions,thereby expanding their biomedical applications.We conclude by addressing the current challenges in the development of organoids-on-a-chip and offering insights into the prospects.The advancement of organoids-on-a-chip is poised to enhance fidelity,standardization,and scalability.Furthermore,the integration of cutting-edge technologies and interdisciplinary collaborations will be crucial for the progression of organoids-on-a-chip technology.
文摘A novel quantum search algorithm tailored for continuous optimization and spectral problems was proposed recently by a research team from the University of Electronic Science and Technology of China to broaden quantum computation frontiers and enrich its application landscape.Quantum computing has traditionally excelled at tackling discrete search challenges,but many important applications from large-scale optimization to advanced physics simulations necessitate searching through continuous domains.These continuous search problems involve uncountably infinite solution spaces and bring about computational complexities far beyond those faced in conventional discrete settings.This draft,titled“Fixed-Point Quantum Continuous Search Algorithm with Optimal Query Complexity”,takes on the core challenge of performing search tasks in domains that may be uncountably infinite,offering theoretical and practical insights into achieving quantum speedups in such settings[1].
基金partially supported by the National Natural Science Foundation of China(Nos.U22A2028,61925208,62102399,62302478,62302483,62222214,62372436,62302482 and 62302480)CAS Project for Young Scientists in Basic Research,China(No.YSBR-029)Youth Innovation Promotion Association CAS and Xplore Prize,China.
文摘In the field of natural language processing,the rapid development of large language model(LLM)has attracted increasing attention.LLMs have shown a high level of creativity in various tasks,but the methods for assessing such creativity are inadequate.Assessment of LLM creativity needs to consider differences from humans,requiring multiple dimensional measurement while balancing accuracy and efficiency.This paper aims to establish an efficient framework for assessing the level of creativity in LLMs.By adapting the modified Torrance tests of creative thinking,the research evaluates the creative performance of various LLMs across 7 tasks,emphasizing 4 criteria including fluency,flexibility,originality,and elaboration.In this context,we develop a comprehensive dataset of 700 questions for testing and an LLM-based evaluation method.In addition,this study presents a novel analysis of LLMs'responses to diverse prompts and role-play situations.We found that the creativity of LLMs primarily falls short in originality,while excelling in elaboration.In addition,the use of prompts and role-play settings of the model significantly influence creativity.Additionally,the experimental results also indicate that collaboration among multiple LLMs can enhance originality.Notably,our findings reveal a consensus between human evaluations and LLMs regarding the personality traits that influence creativity.The findings underscore the significant impact of LLM design on creativity and bridge artificial intelligence and human creativity,offering insights into LLMs'creativity and potential applications.
基金supported in part by the National Natural Science Foundation of China(Grant Nos.62325210 and 62272441).
文摘Submodular maximization is a significant area of interest in combinatorial optimization.It has various real-world applications.In recent years,streaming algorithms for submodular maximization have gained attention,allowing realtime processing of large data sets by examining each piece of data only once.However,most of the current state-of-the-art algorithms are only applicable to monotone submodular maximization.There are still significant gaps in the approximation ratios between monotone and non-monotone objective functions.In this paper,we propose a streaming algorithm framework for non-monotone submodular maximization and use this framework to design deterministic streaming algorithms for the d-knapsack constraint and the knapsack constraint.Our 1-pass streaming algorithm for the d-knapsack constraint has a 1/4(d+1)-∈approximation ratio,using O(BlogB/∈)memory,and O(logB/∈)query time per element,where B=MIN(n,b)is the maximum number of elements that the knapsack can store.As a special case of the d-knapsack constraint,we have the 1-pass streaming algorithm with a 1/8-∈approximation ratio to the knapsack constraint.To our knowledge,there is currently no streaming algorithm for this constraint when the objective function is non-monotone,even when d=1.In addition,we propose a multi-pass streaming algorithm with 1/6-∈approximation,which stores O(B)elements.