Data centers are being distributed worldwide by cloud service providers(CSPs)to save energy costs through efficient workload alloca-tion strategies.Many CSPs are challenged by the significant rise in user demands due ...Data centers are being distributed worldwide by cloud service providers(CSPs)to save energy costs through efficient workload alloca-tion strategies.Many CSPs are challenged by the significant rise in user demands due to their extensive energy consumption during workload pro-cessing.Numerous research studies have examined distinct operating cost mitigation techniques for geo-distributed data centers(DCs).However,oper-ating cost savings during workload processing,which also considers string-matching techniques in geo-distributed DCs,remains unexplored.In this research,we propose a novel string matching-based geographical load balanc-ing(SMGLB)technique to mitigate the operating cost of the geo-distributed DC.The primary goal of this study is to use a string-matching algorithm(i.e.,Boyer Moore)to compare the contents of incoming workloads to those of documents that have already been processed in a data center.A successful match prevents the global load balancer from sending the user’s request to a data center for processing and displaying the results of the previously processed workload to the user to save energy.On the contrary,if no match can be discovered,the global load balancer will allocate the incoming workload to a specific DC for processing considering variable energy prices,the number of active servers,on-site green energy,and traces of incoming workload.The results of numerical evaluations show that the SMGLB can minimize the operating expenses of the geo-distributed data centers more than the existing workload distribution techniques.展开更多
The amount and scale of worldwide data centers grow rapidly in the era of big data,leading to massive energy consumption and formidable carbon emission.To achieve the efficient and sustainable development of informati...The amount and scale of worldwide data centers grow rapidly in the era of big data,leading to massive energy consumption and formidable carbon emission.To achieve the efficient and sustainable development of information technology(IT)industry,researchers propose to schedule data or data analytics jobs to data centers with low electricity prices and carbon emission rates.However,due to the highly heterogeneous and dynamic nature of geo-distributed data centers in terms of resource capacity,electricity price,and the rate of carbon emissions,it is quite difficult to optimize the electricity cost and carbon emission of data centers over a long period.In this paper,we propose an energy-aware data backup and job scheduling method with minimal cost(EDJC)to minimize the electricity cost of geo-distributed data analytics jobs,and simultaneously ensure the long-term carbon emission budget of each data center.Specifically,we firstly design a cost-effective data backup algorithm to generate a data backup strategy that minimizes cost based on historical job requirements.After that,based on the data backup strategy,we utilize an online carbon-aware job scheduling algorithm to calculate the job scheduling strategy in each time slot.In this algorithm,we use the Lyapunov optimization to decompose the long-term job scheduling optimization problem into a series of real-time job scheduling optimization subproblems,and thereby minimize the electricity cost and satisfy the budget of carbon emission.The experimental results show that the EDJC method can significantly reduce the total electricity cost of the data center and meet the carbon emission constraints of the data center at the same time.展开更多
Recent developments in cloud computing and big data have spurred the emergence of data-intensive applications for which massive scientific datasets are stored in globally distributed scientific data centers that have ...Recent developments in cloud computing and big data have spurred the emergence of data-intensive applications for which massive scientific datasets are stored in globally distributed scientific data centers that have a high frequency of data access by scientists worldwide. Multiple associated data items distributed in different scientific data centers may be requested for one data processing task, and data placement decisions must respect the storage capacity limits of the scientific data centers. Therefore, the optimization of data access cost in the placement of data items in globally distributed scientific data centers has become an increasingly important goal.Existing data placement approaches for geo-distributed data items are insufficient because they either cannot cope with the cost incurred by the associated data access, or they overlook storage capacity limitations, which are a very practical constraint of scientific data centers. In this paper, inspired by applications in the field of high energy physics, we propose an integer-programming-based data placement model that addresses the above challenges as a Non-deterministic Polynomial-time(NP)-hard problem. In addition we use a Lagrangian relaxation based heuristics algorithm to obtain ideal data placement solutions. Our simulation results demonstrate that our algorithm is effective and significantly reduces overall data access cost.展开更多
As a typical erasure coding choice, Reed-Solomon (RS) codes have such high repair cost that there is a penaltyfor high reliability and storage efficiency, thereby they are not suitable in geo-distributed storage sys...As a typical erasure coding choice, Reed-Solomon (RS) codes have such high repair cost that there is a penaltyfor high reliability and storage efficiency, thereby they are not suitable in geo-distributed storage systems. We present anovel family of concurrent regeneration codes with local reconstruction (CRL) in this paper. The CRL codes enjoy threebenefits. Firstly, they are able to minimize the network bandwidth for node repair. Secondly, they can reduce the numberof accessed nodes by calculating parities from a subset of data chunks and using an implied parity chunk. Thirdly, they arefaster than existing erasure codes for reconstruction in geo-distributed storage systems. In addition, we demonstrate howthe CRL codes overcome the limitations of the Reed-Solomon codes. We also illustrate analytically that they are excellent inthe trade-off between chunk locality and minimum distance. Furthermore, we present theoretical analysis including latencyanalysis and reliability analysis for the CRL codes. By using quantity comparisons, we prove that CRL(6, 2, 2) is only0.657x of Azure LRC(6, 2, 2), where there are six data chunks, two global parities, and two local parities, and CRL(10,4, 2) is only 0.656x of HDFS-Xorbas(10, 4, 2), where there are 10 data chunks, four local parities, and two global paritiesrespectively, in terms of data reconstruction times. Our experimental results show the performance of CRL by conductingperformance evaluations in both two kinds of environments: 1) it is at least 57.25% and 66.85% more than its competitorsin terms of encoding and decoding throughputs in memory, and 2) it has at least 1.46x and 1.21x higher encoding anddecoding throughputs than its competitors in JBOD (Just a Bunch Of Disks). We also illustrate that CRL is 28.79% and30.19% more than LRC on encoding and decoding throughputs in a geo-distributed environment.展开更多
As the computational demands driven by large model technologies continue to grow rapidly,leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy.When computational resou...As the computational demands driven by large model technologies continue to grow rapidly,leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy.When computational resources within a single cluster are insufficient for large-model training,the hybrid utilization of heterogeneous acceleration hardware has emerged as a promising technical solution.The utilization of heterogeneous acceleration hardware and scheduling of diverse cloud resources have become a focal point of considerable interest.However,these computing resources are often geographically distributed.Due to the lack of awareness of heterogeneous devices and network topologies,existing parallel training frameworks struggle to leverage mixed GPU resources across constrained networks effectively.To boost the computing capability of the connected heterogeneous clusters,we propose HGTrainer,an optimizer designed to plan heterogeneous parallel strategies across distributed clusters for large model training.HGTrainer can adaptively saturate heterogeneous clusters because of the expanded tunable parallelism space for heterogeneous accelerators,with the awareness of relatively lower inter-cluster bandwidth.To achieve this goal,we formulate the model partitioning problem among heterogeneous hardware and introduce a hierarchical searching algorithm to solve the optimization problem.Besides,a mixed-precision pipeline method is used to reduce the cost of inter-cluster communications.We evaluate HGTrainer on heterogeneous connected clusters with popular large language models.The experimental result shows that HGTrainer effectively improves 1.49×training throughput on average for the mixed heterogeneous cluster compared with the state-of-the-art Metis.展开更多
Big data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a single d...Big data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a single datacenter. As the volume of data grows at a tremendous rate, it is less efficient for only one datacenter to handle such large volumes of data from a performance point of view. Large cloud service providers are deploying datacenters geographically around the world for better performance and availability. A widely used approach for analytics of gee-distributed data is the centralized approach, which aggregates all the raw data from local datacenters to a central datacenter. However, it has been observed that this approach consumes a significant amount of bandwidth, leading to worse performance. A number of mechanisms have been proposed to achieve optimal performance when data analytics are performed over geo-distributed datacenters. In this paper, we present a survey on the representative mechanisms proposed in the literature for wide area analytics. We discuss basic ideas, present proposed architectures and mechanisms, and discuss several examples to illustrate existing work. We point out the limitations of these mechanisms, give comparisons, and conclude with our thoughts on future research directions.展开更多
文摘Data centers are being distributed worldwide by cloud service providers(CSPs)to save energy costs through efficient workload alloca-tion strategies.Many CSPs are challenged by the significant rise in user demands due to their extensive energy consumption during workload pro-cessing.Numerous research studies have examined distinct operating cost mitigation techniques for geo-distributed data centers(DCs).However,oper-ating cost savings during workload processing,which also considers string-matching techniques in geo-distributed DCs,remains unexplored.In this research,we propose a novel string matching-based geographical load balanc-ing(SMGLB)technique to mitigate the operating cost of the geo-distributed DC.The primary goal of this study is to use a string-matching algorithm(i.e.,Boyer Moore)to compare the contents of incoming workloads to those of documents that have already been processed in a data center.A successful match prevents the global load balancer from sending the user’s request to a data center for processing and displaying the results of the previously processed workload to the user to save energy.On the contrary,if no match can be discovered,the global load balancer will allocate the incoming workload to a specific DC for processing considering variable energy prices,the number of active servers,on-site green energy,and traces of incoming workload.The results of numerical evaluations show that the SMGLB can minimize the operating expenses of the geo-distributed data centers more than the existing workload distribution techniques.
基金supported by the National Natural Science Foundation of China under Grant Nos.U23B2004 and 62162018,the Guangxi Natural Science Foundation under Grant Nos.2025GXNSFBA069389 and 2025GXNSFBA069101the Key Research and Development Program of Guangxi under Grant No.AB25069157+1 种基金the Hunan Provincial Excellent Youth Fund under Grant No.2023JJ20055the Open Project Program of Guangxi Key Laboratory of Digital Infrastructure under Grant No.GXDIOP2024005.
文摘The amount and scale of worldwide data centers grow rapidly in the era of big data,leading to massive energy consumption and formidable carbon emission.To achieve the efficient and sustainable development of information technology(IT)industry,researchers propose to schedule data or data analytics jobs to data centers with low electricity prices and carbon emission rates.However,due to the highly heterogeneous and dynamic nature of geo-distributed data centers in terms of resource capacity,electricity price,and the rate of carbon emissions,it is quite difficult to optimize the electricity cost and carbon emission of data centers over a long period.In this paper,we propose an energy-aware data backup and job scheduling method with minimal cost(EDJC)to minimize the electricity cost of geo-distributed data analytics jobs,and simultaneously ensure the long-term carbon emission budget of each data center.Specifically,we firstly design a cost-effective data backup algorithm to generate a data backup strategy that minimizes cost based on historical job requirements.After that,based on the data backup strategy,we utilize an online carbon-aware job scheduling algorithm to calculate the job scheduling strategy in each time slot.In this algorithm,we use the Lyapunov optimization to decompose the long-term job scheduling optimization problem into a series of real-time job scheduling optimization subproblems,and thereby minimize the electricity cost and satisfy the budget of carbon emission.The experimental results show that the EDJC method can significantly reduce the total electricity cost of the data center and meet the carbon emission constraints of the data center at the same time.
基金supported by the National Natural Science Foundation of China (Nos. 61320106007, 61572129, 61502097, and 61370207)the National High-Tech Research and Development (863) Program of China (No. 2013AA013503)+4 种基金International S&T Cooperation Program of China (No. 2015DFA10490)Jiangsu research prospective joint research project (No. BY2013073-01)Jiangsu Provincial Key Laboratory of Network and Information Security (No. BM2003201)Key Laboratory of Computer Network and Information Integration of Ministry of Education of China (No. 93K-9)supported by Collaborative Innovation Center of Novel Software Technology and Industrialization and Collaborative Innovation Center of Wireless Communications Technology
文摘Recent developments in cloud computing and big data have spurred the emergence of data-intensive applications for which massive scientific datasets are stored in globally distributed scientific data centers that have a high frequency of data access by scientists worldwide. Multiple associated data items distributed in different scientific data centers may be requested for one data processing task, and data placement decisions must respect the storage capacity limits of the scientific data centers. Therefore, the optimization of data access cost in the placement of data items in globally distributed scientific data centers has become an increasingly important goal.Existing data placement approaches for geo-distributed data items are insufficient because they either cannot cope with the cost incurred by the associated data access, or they overlook storage capacity limitations, which are a very practical constraint of scientific data centers. In this paper, inspired by applications in the field of high energy physics, we propose an integer-programming-based data placement model that addresses the above challenges as a Non-deterministic Polynomial-time(NP)-hard problem. In addition we use a Lagrangian relaxation based heuristics algorithm to obtain ideal data placement solutions. Our simulation results demonstrate that our algorithm is effective and significantly reduces overall data access cost.
文摘As a typical erasure coding choice, Reed-Solomon (RS) codes have such high repair cost that there is a penaltyfor high reliability and storage efficiency, thereby they are not suitable in geo-distributed storage systems. We present anovel family of concurrent regeneration codes with local reconstruction (CRL) in this paper. The CRL codes enjoy threebenefits. Firstly, they are able to minimize the network bandwidth for node repair. Secondly, they can reduce the numberof accessed nodes by calculating parities from a subset of data chunks and using an implied parity chunk. Thirdly, they arefaster than existing erasure codes for reconstruction in geo-distributed storage systems. In addition, we demonstrate howthe CRL codes overcome the limitations of the Reed-Solomon codes. We also illustrate analytically that they are excellent inthe trade-off between chunk locality and minimum distance. Furthermore, we present theoretical analysis including latencyanalysis and reliability analysis for the CRL codes. By using quantity comparisons, we prove that CRL(6, 2, 2) is only0.657x of Azure LRC(6, 2, 2), where there are six data chunks, two global parities, and two local parities, and CRL(10,4, 2) is only 0.656x of HDFS-Xorbas(10, 4, 2), where there are 10 data chunks, four local parities, and two global paritiesrespectively, in terms of data reconstruction times. Our experimental results show the performance of CRL by conductingperformance evaluations in both two kinds of environments: 1) it is at least 57.25% and 66.85% more than its competitorsin terms of encoding and decoding throughputs in memory, and 2) it has at least 1.46x and 1.21x higher encoding anddecoding throughputs than its competitors in JBOD (Just a Bunch Of Disks). We also illustrate that CRL is 28.79% and30.19% more than LRC on encoding and decoding throughputs in a geo-distributed environment.
基金supported by the National Key R&D Program of China(No.2022ZD0115304)the National Natural Science Foundation of China for Young Scientists Fund(No.62402266)+1 种基金the National Natural Science Foundation of China for Distinguished Young Scholar(No.62225206)the CCF-Ant Group Research Fund CCF-AFSG(No.RF20240501).
文摘As the computational demands driven by large model technologies continue to grow rapidly,leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy.When computational resources within a single cluster are insufficient for large-model training,the hybrid utilization of heterogeneous acceleration hardware has emerged as a promising technical solution.The utilization of heterogeneous acceleration hardware and scheduling of diverse cloud resources have become a focal point of considerable interest.However,these computing resources are often geographically distributed.Due to the lack of awareness of heterogeneous devices and network topologies,existing parallel training frameworks struggle to leverage mixed GPU resources across constrained networks effectively.To boost the computing capability of the connected heterogeneous clusters,we propose HGTrainer,an optimizer designed to plan heterogeneous parallel strategies across distributed clusters for large model training.HGTrainer can adaptively saturate heterogeneous clusters because of the expanded tunable parallelism space for heterogeneous accelerators,with the awareness of relatively lower inter-cluster bandwidth.To achieve this goal,we formulate the model partitioning problem among heterogeneous hardware and introduce a hierarchical searching algorithm to solve the optimization problem.Besides,a mixed-precision pipeline method is used to reduce the cost of inter-cluster communications.We evaluate HGTrainer on heterogeneous connected clusters with popular large language models.The experimental result shows that HGTrainer effectively improves 1.49×training throughput on average for the mixed heterogeneous cluster compared with the state-of-the-art Metis.
文摘Big data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a single datacenter. As the volume of data grows at a tremendous rate, it is less efficient for only one datacenter to handle such large volumes of data from a performance point of view. Large cloud service providers are deploying datacenters geographically around the world for better performance and availability. A widely used approach for analytics of gee-distributed data is the centralized approach, which aggregates all the raw data from local datacenters to a central datacenter. However, it has been observed that this approach consumes a significant amount of bandwidth, leading to worse performance. A number of mechanisms have been proposed to achieve optimal performance when data analytics are performed over geo-distributed datacenters. In this paper, we present a survey on the representative mechanisms proposed in the literature for wide area analytics. We discuss basic ideas, present proposed architectures and mechanisms, and discuss several examples to illustrate existing work. We point out the limitations of these mechanisms, give comparisons, and conclude with our thoughts on future research directions.