With the continuous development of artificial intelligence technology,large-scale language models have demonstrated significant potential across various fields.In education,an increasing number of methods leverage lar...With the continuous development of artificial intelligence technology,large-scale language models have demonstrated significant potential across various fields.In education,an increasing number of methods leverage large-scale language models to enhance educational quality,introducing new ideas and opportunities for reform.However,training a large language model with substantial professional knowledge to meet teaching needs incurs high labor costs.The fine-tuning approach based on human feedback alignment can significantly lower these model labor costs.Consequently,this article thoroughly investigates the application of this large prediction model method,which is rooted in human feed-back alignment,within the educational reform of algorithm analysis and design courses and examines its impact on teaching effectiveness and students’learning experiences.展开更多
Diversity plays a significant role in many problems,such as ensemble learning,reinforcement learning,and combinatorial optimization.How to define the diversity measure is a longstanding problem.Many methods rely on ex...Diversity plays a significant role in many problems,such as ensemble learning,reinforcement learning,and combinatorial optimization.How to define the diversity measure is a longstanding problem.Many methods rely on expert experience to define a proper behavior space and then obtain the diversity measure,which is,however,challenging in many scenarios.In this paper,we propose the problem of learning a behavior space from human feedback and present a general method called Diversity from Human Feedback(DivHF)to solve it.DivHF learns a behavior descriptor consistent with human preference by querying human feedback.The learned behavior descriptor can be combined with any distance measure to define a diversity measure.We demonstrate the effectiveness of DivHF by integrating it with the Quality-Diversity optimization algorithm MAP-Elites and conducting experiments on the QDax suite.The results show that the behavior learned by DivHF is much more consistent with human requirements than the one learned by direct data-driven approaches without human feedback,and makes the final solutions more diverse under human preference.Our contributions include formulating the problem,proposing the DivHF method,and demonstrating its effectiveness through experiments.展开更多
Artificial intelligence empowers the rapid development of autonomous intelligent systems(AISs),but it still struggles to cope with open,complex,dynamic,and uncertain environments,limiting its large-scale industrial ap...Artificial intelligence empowers the rapid development of autonomous intelligent systems(AISs),but it still struggles to cope with open,complex,dynamic,and uncertain environments,limiting its large-scale industrial application.Reliable human feedback provides a mechanism for aligning machine behavior with human values and holds promise as a new paradigm for the evolution and enhancement of machine intelligence.This paper analyzes the engineering insights from ChatGPT and elaborates on the evolution from traditional feedback to human feedback.Then,a unified framework for self-evolving intelligent driving(ID)based on human feedback is proposed.Finally,an application in the congested ramp scenario illustrates the effectiveness of the proposed framework.展开更多
The efficiency of carrier-based aircraft support operation scheduling critically impacts aircraft carrier operational effectiveness by determining sortie generation rates,yet faces significant challenges in complex de...The efficiency of carrier-based aircraft support operation scheduling critically impacts aircraft carrier operational effectiveness by determining sortie generation rates,yet faces significant challenges in complex deck environments characterized by resource coupling,dynamic constraints,and highdimensional state-action spaces.Traditional optimization algorithms and vanilla reinforcement learning(RL)struggle with computational inefficiency,sparse rewards,and adaptability to dynamic scenarios,while human expert systems are constrained by the quality of expert knowledge,and poor expert guidance may even have a negative impact.To address these limitations,this paper proposes a human experience-guided actor-critic reinforcement learning framework that synergizes domain expertise with adaptive learning.First,a dynamic Markov decision process(MDP)model is developed to rigorously simulate carrier deck operations,explicitly encoding constraints on positions,resources,and collision avoidance.Building upon this foundation,a human experience database is constructed to enable real-time pattern-matching-based intervention during agent-environment interactions,dynamically correcting wrong actions to avoid catastrophic states while refining exploration efficiency.Finally,the policy and value network objectives are reshaped to incorporate human intent through hybrid reward functions and adaptive guidance weighting,ensuring balanced integration of expert knowledge with RL's exploration capabilities.Extensive simulations across three scenarios demonstrate superior performance compared to state-of-the-art methods and maintain robustness under suboptimal human guidance.These results validate the framework's ability to harmonize human expertise with adaptive learning,offering a practical solution for real-world carriers.展开更多
Two-way feedback of human body was published in 1992. The sensation of two-way feedback of body is a special system of human reaction, which maintains and regulates symmetry and balance of human body. The human two-wa...Two-way feedback of human body was published in 1992. The sensation of two-way feedback of body is a special system of human reaction, which maintains and regulates symmetry and balance of human body. The human two-way feedback reacts to human health. For human overall health and delay decrepitude, it is necessary to pay attention to the stimulations (passive acceptance and initiative interventions) and relevant influences in human body and the stimulative effect. In this paper, the experimental research of stimulation and an example of two-way feedback in human body are given. And lay a foundation of prevention, medical treatment and hygiene of human overall health.展开更多
With the increasing of the elderly population and the growing hearth care cost, the role of service robots in aiding the disabled and the elderly is becoming important. Many researchers in the world have paid much att...With the increasing of the elderly population and the growing hearth care cost, the role of service robots in aiding the disabled and the elderly is becoming important. Many researchers in the world have paid much attention to heaRthcare robots and rehabilitation robots. To get natural and harmonious communication between the user and a service robot, the information perception/feedback ability, and interaction ability for service robots become more important in many key issues.展开更多
Existing methods for tracing water pollution sources typically integrate three-dimensional excitationemission matrix(3D-EEM)fluorescence spectroscopy with similarity-based matching algorithms.However,these approaches ...Existing methods for tracing water pollution sources typically integrate three-dimensional excitationemission matrix(3D-EEM)fluorescence spectroscopy with similarity-based matching algorithms.However,these approaches exhibit high error rates in borderline cases and necessitate expert manual review,which limits scalability and introduces inconsistencies between algorithmic outputs and expert judgment.To address these limitations,we propose a large vision-language model(VLM)designed as an“expert agent”to automatically refine similarity scores,ensuring alignment with expert decisions and overcoming key application bottlenecks.The model consists of two core components:(1)rule-based similarity calculation module generate initial spectral similarity scores,and(2)pre-trained large vision-language model fine-tuned via supervised learning and reinforcement learning with human feedback(RLHF)to emulate expert assessments.To facilitate training and evaluation,we introduce two expert-annotated datasets,Spec1k and SpecReason,which capture both quantitative corrections and qualitative reasoning patterns,allowing the model to emulate expert decision-making processes.Experimental results demonstrate that our method achieves 81.45%source attribution accuracy,38.24%higher than rule-based and machine learning baselines.Real-world deployment further validates its effectiveness.展开更多
人类反馈强化学习(Reinforcement Learning from Human Feedback,RLHF)整合了人类智慧与机器的力量.它通过人类培训师对人工智能系统的行为或输出给予的反馈评价或建议,完成奖励信号的创建或智能体策略的改变等.高质量的人类反馈能够显...人类反馈强化学习(Reinforcement Learning from Human Feedback,RLHF)整合了人类智慧与机器的力量.它通过人类培训师对人工智能系统的行为或输出给予的反馈评价或建议,完成奖励信号的创建或智能体策略的改变等.高质量的人类反馈能够显著提升人工智能系统对人类偏好和价值观的理解与适应能力,然而,高质量数据的稀缺性成为了RLHF进一步发展的瓶颈.近期,AI反馈强化学习(Reinforcement Learning from AI Feedback,RLAIF)的兴起为突破这一限制提供了新的视角,促使本文重新审视并定义了一个更广泛的框架X-反馈强化学习(Reinforcement Learning from X-Feedback,RLXF).RLXF是一种结合了多种反馈源(包括人类和AI)来指导强化学习过程的框架.这些反馈可以是直接的奖励信号、策略建议、偏好排序等多种形式,旨在优化智能体的行为策略,以更好地适应复杂多变的环境和满足多样化的目标.围绕RLXF,从方法论创新到前沿应用进行系统性探讨:首先,建立RLXF的统一理论框架,阐明其通过多源反馈实现策略优化的核心机理;其次,将现有研究分为模仿学习、基于人类反馈的强化学习(RLHF)及基于AI反馈的强化学习(RLAIF)等三种反馈范式.进而详细讨论了RLXF在自动驾驶、具身智能与大型语言模型(Large Language Models,LLMs)等关键领域的突破性应用实例.最后,总结了RLXF当前面临的主要挑战,并对其未来发展方向进行了展望.展开更多
One particular challenge for large‑scale software systems is anomaly detection.System logs are a straightforward and common source of information for anomaly detection.Existing log‑based anomaly detectors are unusable...One particular challenge for large‑scale software systems is anomaly detection.System logs are a straightforward and common source of information for anomaly detection.Existing log‑based anomaly detectors are unusable in real‑world industrial systems due to high false‑positive rates.In this paper,we incorporate human feedback to adjust the detection model structure to reduce false positives.We apply our approach to two industrial large‑scale systems.Results have shown that our approach performs much better than state‑of‑the-art works with 50%higher accuracy.Besides,human feedback can reduce more than 70%of false positives and greatly improve detection precision.展开更多
基于人类反馈的强化学习(reinforcement learning with human feedback,RLHF)作为当前大语言模型(large language models,LLMs)对齐的主流方法,其核心优化算法——近端策略优化(proximal policy optimization,PPO)却面临着显著的效率问...基于人类反馈的强化学习(reinforcement learning with human feedback,RLHF)作为当前大语言模型(large language models,LLMs)对齐的主流方法,其核心优化算法——近端策略优化(proximal policy optimization,PPO)却面临着显著的效率问题.PPO由生成、推理、训练3个相互关联的阶段组成,各个阶段有着不同的计算特性.然而,现有的RLHF并行框架采用相同并行策略顺序执行PPO的所有阶段,这导致以下2个问题:其一,生成阶段不能充分利用计算资源,进而影响整体效率;其二,阶段间严格串行执行,未能充分利用潜在并行性.针对上述问题,提出了一个新型RLHF并行框架——Pipe-RLHF.该框架能够自适应地根据各阶段的计算特征确定最优并行策略,突破现有阶段串行范式,采用异步PPO算法发掘阶段间的并行性.具体而言,创新性地提出了适用于PPO生成阶段的延迟批间流水线并行方法,显著提升了该阶段的计算资源利用率;再次,使用异步PPO解放阶段间的依赖关系,将阶段间并行应用到PPO的加速上;最后,针对PPO算法的整体优化,构建了分层并行策略空间,并提出了一套优化算法以实现该空间中的最优解搜索.通过在多个大语言模型上的性能评估实验表明,相较于现有方法,Pipe-RLHF最高可实现3.7倍的加速比,充分验证了该框架的有效性和优越性.展开更多
文摘With the continuous development of artificial intelligence technology,large-scale language models have demonstrated significant potential across various fields.In education,an increasing number of methods leverage large-scale language models to enhance educational quality,introducing new ideas and opportunities for reform.However,training a large language model with substantial professional knowledge to meet teaching needs incurs high labor costs.The fine-tuning approach based on human feedback alignment can significantly lower these model labor costs.Consequently,this article thoroughly investigates the application of this large prediction model method,which is rooted in human feed-back alignment,within the educational reform of algorithm analysis and design courses and examines its impact on teaching effectiveness and students’learning experiences.
基金supported by the National Natural Science Foundation of China(Grant Nos.62276124,62272210)the Fundamental Research Funds for the Central Universities(14380020)+1 种基金supported by the National Natural Science Foundation of China for Ph.D.Students(624B2069)Young Elite Scientists Sponsorship Program by CAST for Ph.D.Students.
文摘Diversity plays a significant role in many problems,such as ensemble learning,reinforcement learning,and combinatorial optimization.How to define the diversity measure is a longstanding problem.Many methods rely on expert experience to define a proper behavior space and then obtain the diversity measure,which is,however,challenging in many scenarios.In this paper,we propose the problem of learning a behavior space from human feedback and present a general method called Diversity from Human Feedback(DivHF)to solve it.DivHF learns a behavior descriptor consistent with human preference by querying human feedback.The learned behavior descriptor can be combined with any distance measure to define a diversity measure.We demonstrate the effectiveness of DivHF by integrating it with the Quality-Diversity optimization algorithm MAP-Elites and conducting experiments on the QDax suite.The results show that the behavior learned by DivHF is much more consistent with human requirements than the one learned by direct data-driven approaches without human feedback,and makes the final solutions more diverse under human preference.Our contributions include formulating the problem,proposing the DivHF method,and demonstrating its effectiveness through experiments.
基金supported by the National Natural Science Foundation of China under Grant No.62088101.
文摘Artificial intelligence empowers the rapid development of autonomous intelligent systems(AISs),but it still struggles to cope with open,complex,dynamic,and uncertain environments,limiting its large-scale industrial application.Reliable human feedback provides a mechanism for aligning machine behavior with human values and holds promise as a new paradigm for the evolution and enhancement of machine intelligence.This paper analyzes the engineering insights from ChatGPT and elaborates on the evolution from traditional feedback to human feedback.Then,a unified framework for self-evolving intelligent driving(ID)based on human feedback is proposed.Finally,an application in the congested ramp scenario illustrates the effectiveness of the proposed framework.
基金supported by funding from the National Natural Science Foundation of China(Grant Nos.62325602,62406292,62302459,62406293,and 62036010)。
文摘The efficiency of carrier-based aircraft support operation scheduling critically impacts aircraft carrier operational effectiveness by determining sortie generation rates,yet faces significant challenges in complex deck environments characterized by resource coupling,dynamic constraints,and highdimensional state-action spaces.Traditional optimization algorithms and vanilla reinforcement learning(RL)struggle with computational inefficiency,sparse rewards,and adaptability to dynamic scenarios,while human expert systems are constrained by the quality of expert knowledge,and poor expert guidance may even have a negative impact.To address these limitations,this paper proposes a human experience-guided actor-critic reinforcement learning framework that synergizes domain expertise with adaptive learning.First,a dynamic Markov decision process(MDP)model is developed to rigorously simulate carrier deck operations,explicitly encoding constraints on positions,resources,and collision avoidance.Building upon this foundation,a human experience database is constructed to enable real-time pattern-matching-based intervention during agent-environment interactions,dynamically correcting wrong actions to avoid catastrophic states while refining exploration efficiency.Finally,the policy and value network objectives are reshaped to incorporate human intent through hybrid reward functions and adaptive guidance weighting,ensuring balanced integration of expert knowledge with RL's exploration capabilities.Extensive simulations across three scenarios demonstrate superior performance compared to state-of-the-art methods and maintain robustness under suboptimal human guidance.These results validate the framework's ability to harmonize human expertise with adaptive learning,offering a practical solution for real-world carriers.
文摘Two-way feedback of human body was published in 1992. The sensation of two-way feedback of body is a special system of human reaction, which maintains and regulates symmetry and balance of human body. The human two-way feedback reacts to human health. For human overall health and delay decrepitude, it is necessary to pay attention to the stimulations (passive acceptance and initiative interventions) and relevant influences in human body and the stimulative effect. In this paper, the experimental research of stimulation and an example of two-way feedback in human body are given. And lay a foundation of prevention, medical treatment and hygiene of human overall health.
文摘With the increasing of the elderly population and the growing hearth care cost, the role of service robots in aiding the disabled and the elderly is becoming important. Many researchers in the world have paid much attention to heaRthcare robots and rehabilitation robots. To get natural and harmonious communication between the user and a service robot, the information perception/feedback ability, and interaction ability for service robots become more important in many key issues.
文摘Existing methods for tracing water pollution sources typically integrate three-dimensional excitationemission matrix(3D-EEM)fluorescence spectroscopy with similarity-based matching algorithms.However,these approaches exhibit high error rates in borderline cases and necessitate expert manual review,which limits scalability and introduces inconsistencies between algorithmic outputs and expert judgment.To address these limitations,we propose a large vision-language model(VLM)designed as an“expert agent”to automatically refine similarity scores,ensuring alignment with expert decisions and overcoming key application bottlenecks.The model consists of two core components:(1)rule-based similarity calculation module generate initial spectral similarity scores,and(2)pre-trained large vision-language model fine-tuned via supervised learning and reinforcement learning with human feedback(RLHF)to emulate expert assessments.To facilitate training and evaluation,we introduce two expert-annotated datasets,Spec1k and SpecReason,which capture both quantitative corrections and qualitative reasoning patterns,allowing the model to emulate expert decision-making processes.Experimental results demonstrate that our method achieves 81.45%source attribution accuracy,38.24%higher than rule-based and machine learning baselines.Real-world deployment further validates its effectiveness.
文摘人类反馈强化学习(Reinforcement Learning from Human Feedback,RLHF)整合了人类智慧与机器的力量.它通过人类培训师对人工智能系统的行为或输出给予的反馈评价或建议,完成奖励信号的创建或智能体策略的改变等.高质量的人类反馈能够显著提升人工智能系统对人类偏好和价值观的理解与适应能力,然而,高质量数据的稀缺性成为了RLHF进一步发展的瓶颈.近期,AI反馈强化学习(Reinforcement Learning from AI Feedback,RLAIF)的兴起为突破这一限制提供了新的视角,促使本文重新审视并定义了一个更广泛的框架X-反馈强化学习(Reinforcement Learning from X-Feedback,RLXF).RLXF是一种结合了多种反馈源(包括人类和AI)来指导强化学习过程的框架.这些反馈可以是直接的奖励信号、策略建议、偏好排序等多种形式,旨在优化智能体的行为策略,以更好地适应复杂多变的环境和满足多样化的目标.围绕RLXF,从方法论创新到前沿应用进行系统性探讨:首先,建立RLXF的统一理论框架,阐明其通过多源反馈实现策略优化的核心机理;其次,将现有研究分为模仿学习、基于人类反馈的强化学习(RLHF)及基于AI反馈的强化学习(RLAIF)等三种反馈范式.进而详细讨论了RLXF在自动驾驶、具身智能与大型语言模型(Large Language Models,LLMs)等关键领域的突破性应用实例.最后,总结了RLXF当前面临的主要挑战,并对其未来发展方向进行了展望.
基金ZTE Industry-University-Institute Cooperation Funds under Grant No.20200492.
文摘One particular challenge for large‑scale software systems is anomaly detection.System logs are a straightforward and common source of information for anomaly detection.Existing log‑based anomaly detectors are unusable in real‑world industrial systems due to high false‑positive rates.In this paper,we incorporate human feedback to adjust the detection model structure to reduce false positives.We apply our approach to two industrial large‑scale systems.Results have shown that our approach performs much better than state‑of‑the-art works with 50%higher accuracy.Besides,human feedback can reduce more than 70%of false positives and greatly improve detection precision.
文摘基于人类反馈的强化学习(reinforcement learning with human feedback,RLHF)作为当前大语言模型(large language models,LLMs)对齐的主流方法,其核心优化算法——近端策略优化(proximal policy optimization,PPO)却面临着显著的效率问题.PPO由生成、推理、训练3个相互关联的阶段组成,各个阶段有着不同的计算特性.然而,现有的RLHF并行框架采用相同并行策略顺序执行PPO的所有阶段,这导致以下2个问题:其一,生成阶段不能充分利用计算资源,进而影响整体效率;其二,阶段间严格串行执行,未能充分利用潜在并行性.针对上述问题,提出了一个新型RLHF并行框架——Pipe-RLHF.该框架能够自适应地根据各阶段的计算特征确定最优并行策略,突破现有阶段串行范式,采用异步PPO算法发掘阶段间的并行性.具体而言,创新性地提出了适用于PPO生成阶段的延迟批间流水线并行方法,显著提升了该阶段的计算资源利用率;再次,使用异步PPO解放阶段间的依赖关系,将阶段间并行应用到PPO的加速上;最后,针对PPO算法的整体优化,构建了分层并行策略空间,并提出了一套优化算法以实现该空间中的最优解搜索.通过在多个大语言模型上的性能评估实验表明,相较于现有方法,Pipe-RLHF最高可实现3.7倍的加速比,充分验证了该框架的有效性和优越性.