针对传统云平台运维监控系统存在的监控信息分散、无效告警冗余、误报率高,且难以支撑全局化运维决策的问题,在分析现有基于环状数据库(Round Robin Database,RRD)、Zabbix等技术的监控方案不足的基础上,对Prometheus与Grafana的技术适...针对传统云平台运维监控系统存在的监控信息分散、无效告警冗余、误报率高,且难以支撑全局化运维决策的问题,在分析现有基于环状数据库(Round Robin Database,RRD)、Zabbix等技术的监控方案不足的基础上,对Prometheus与Grafana的技术适配性进行简要介绍,提出一种基于该技术组合的私有云监控系统及实现方法。系统通过“数据采集,数据存储,监控展示,告警执行”四大模块协同工作:数据采集模块采用接口与探针双轨制策略,结合跨网交互方案与接口限流突破机制,实现多网络环境下监控数据的全面获取;数据存储模块构建“逻辑组织、分片存储、联邦聚合”3层架构,基于Prometheus时序数据库与标签扩展模型,解决多源异构数据的高效存储与查询问题;告警执行模块引入动态阈值算法、分级抑制策略及告警风暴处理机制,提升告警准确性与可控性。通过搭建包含3台物理服务器的私有云测试集群,以Nagios系统为对照,对系统在正常负载、资源过载、网络隔离等场景下的性能进行仿真测试与对比分析。实验结果表明,与传统方案相比,该系统72h累计无效告警减少70.9%,告警准确率提升至92.2%(较对照组提高72.7%),平均告警延迟降低57.1%,同时CPU与内存资源占用分别减少6.8%和0.9 GB。研究结论显示,该系统可有效克服传统监控装置的缺陷,显著提升私有云平台的运行稳定性与运维效率,具备较强的工程实践推广价值。展开更多
As an important complement to cloud computing, edge computing can effectively reduce the workload of the backbone network. To reduce latency and energy consumption of edge computing, deep learning is used to learn the...As an important complement to cloud computing, edge computing can effectively reduce the workload of the backbone network. To reduce latency and energy consumption of edge computing, deep learning is used to learn the task offloading strategies by interacting with the entities. In actual application scenarios, users of edge computing are always changing dynamically. However, the existing task offloading strategies cannot be applied to such dynamic scenarios. To solve this problem, we propose a novel dynamic task offloading framework for distributed edge computing, leveraging the potential of meta-reinforcement learning (MRL). Our approach formulates a multi-objective optimization problem aimed at minimizing both delay and energy consumption. We model the task offloading strategy using a directed acyclic graph (DAG). Furthermore, we propose a distributed edge computing adaptive task offloading algorithm rooted in MRL. This algorithm integrates multiple Markov decision processes (MDP) with a sequence-to-sequence (seq2seq) network, enabling it to learn and adapt task offloading strategies responsively across diverse network environments. To achieve joint optimization of delay and energy consumption, we incorporate the non-dominated sorting genetic algorithm II (NSGA-II) into our framework. Simulation results demonstrate the superiority of our proposed solution, achieving a 21% reduction in time delay and a 19% decrease in energy consumption compared to alternative task offloading schemes. Moreover, our scheme exhibits remarkable adaptability, responding swiftly to changes in various network environments.展开更多
文摘针对传统云平台运维监控系统存在的监控信息分散、无效告警冗余、误报率高,且难以支撑全局化运维决策的问题,在分析现有基于环状数据库(Round Robin Database,RRD)、Zabbix等技术的监控方案不足的基础上,对Prometheus与Grafana的技术适配性进行简要介绍,提出一种基于该技术组合的私有云监控系统及实现方法。系统通过“数据采集,数据存储,监控展示,告警执行”四大模块协同工作:数据采集模块采用接口与探针双轨制策略,结合跨网交互方案与接口限流突破机制,实现多网络环境下监控数据的全面获取;数据存储模块构建“逻辑组织、分片存储、联邦聚合”3层架构,基于Prometheus时序数据库与标签扩展模型,解决多源异构数据的高效存储与查询问题;告警执行模块引入动态阈值算法、分级抑制策略及告警风暴处理机制,提升告警准确性与可控性。通过搭建包含3台物理服务器的私有云测试集群,以Nagios系统为对照,对系统在正常负载、资源过载、网络隔离等场景下的性能进行仿真测试与对比分析。实验结果表明,与传统方案相比,该系统72h累计无效告警减少70.9%,告警准确率提升至92.2%(较对照组提高72.7%),平均告警延迟降低57.1%,同时CPU与内存资源占用分别减少6.8%和0.9 GB。研究结论显示,该系统可有效克服传统监控装置的缺陷,显著提升私有云平台的运行稳定性与运维效率,具备较强的工程实践推广价值。
基金funded by the Fundamental Research Funds for the Central Universities(J2023-024,J2023-027).
文摘As an important complement to cloud computing, edge computing can effectively reduce the workload of the backbone network. To reduce latency and energy consumption of edge computing, deep learning is used to learn the task offloading strategies by interacting with the entities. In actual application scenarios, users of edge computing are always changing dynamically. However, the existing task offloading strategies cannot be applied to such dynamic scenarios. To solve this problem, we propose a novel dynamic task offloading framework for distributed edge computing, leveraging the potential of meta-reinforcement learning (MRL). Our approach formulates a multi-objective optimization problem aimed at minimizing both delay and energy consumption. We model the task offloading strategy using a directed acyclic graph (DAG). Furthermore, we propose a distributed edge computing adaptive task offloading algorithm rooted in MRL. This algorithm integrates multiple Markov decision processes (MDP) with a sequence-to-sequence (seq2seq) network, enabling it to learn and adapt task offloading strategies responsively across diverse network environments. To achieve joint optimization of delay and energy consumption, we incorporate the non-dominated sorting genetic algorithm II (NSGA-II) into our framework. Simulation results demonstrate the superiority of our proposed solution, achieving a 21% reduction in time delay and a 19% decrease in energy consumption compared to alternative task offloading schemes. Moreover, our scheme exhibits remarkable adaptability, responding swiftly to changes in various network environments.