Cloud Task Scheduling Based on Proximal Policy Optimization Algorithm for Lowering Energy Consumption of Data Center

Yang, Yongquan;He, Cuihua;Yin, Bo;Wei, Zhiqiang;Hong, Bowei;

doi:10.3837/tiis.2022.06.006

KSII Transactions on Internet and Information Systems (TIIS)

제16권6호
/
Pages.1877-1891
/
2022
/
1976-7277(pISSN)
/
1976-7277(eISSN)

한국인터넷정보학회 (Korean Society for Internet Information)

DOI QR Code

Cloud Task Scheduling Based on Proximal Policy Optimization Algorithm for Lowering Energy Consumption of Data Center

Yang, Yongquan (Department of Computer Science and technology (Ocean University of China)) ;
He, Cuihua (Department of Computer Science and technology (Ocean University of China)) ;
Yin, Bo (Department of Computer Science and technology (Ocean University of China)) ;
Wei, Zhiqiang (Department of Computer Science and technology (Ocean University of China)) ;
Hong, Bowei (Department of Computer Science and technology (Ocean University of China))

투고 : 2022.01.05
심사 : 2022.05.30
발행 : 2022.06.30

https://doi.org/10.3837/tiis.2022.06.006 인용 PDF KSCI HTML

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

As a part of cloud computing technology, algorithms for cloud task scheduling place an important influence on the area of cloud computing in data centers. In our earlier work, we proposed DeepEnergyJS, which was designed based on the original version of the policy gradient and reinforcement learning algorithm. We verified its effectiveness through simulation experiments. In this study, we used the Proximal Policy Optimization (PPO) algorithm to update DeepEnergyJS to DeepEnergyJSV2.0. First, we verify the convergence of the PPO algorithm on the dataset of Alibaba Cluster Data V2018. Then we contrast it with reinforcement learning algorithm in terms of convergence rate, converged value, and stability. The results indicate that PPO performed better in training and test data sets compared with reinforcement learning algorithm, as well as other general heuristic algorithms, such as First Fit, Random, and Tetris. DeepEnergyJSV2.0 achieves better energy efficiency than DeepEnergyJS by about 7.814%.

키워드

1. Introduction

Cloud computing has become a trend in high-performance computing and is characterized by its large-scale, heterogeneous computing resources, and flexible computational architecture. In recent years, active work has appeared in cloud computing areas such as scheduling, placement, energy management, privacy and policy, security [1–4], and more. Algorithms for task scheduling have attracted extensive attention as a part of cloud computing technology. Task scheduling refers to mapping several tasks to computational resources. With the development of cloud service providers (CSPs), huge energy consumption and carbon dioxide emissions have become a serious challenge. Developing an energy-saving task scheduling strategy has practical importance.

Q-learning [5] is a classic algorithm for reinforcement learning. Deep neural networks have shown strong fitting ability in many fields, and reinforcement learning has an excellent ability for decision-making. Deep reinforcement learning (DRL) [6] combines deep learning (DL) [7] and reinforcement learning (RL) [8] algorithms, which enable it to solve complex control problems with a large state/action space. Deep learning captures the features of dynamic scenes from the current environment, and RL learns the best strategy guided by the corresponding reward obtained from interactions with the environment.

In our previous work, we proposed DeepEnergyJS [9], a cloud task scheduling framework based on deep reinforcement learning algorithms [10]. DeepEnergyJS obtained acceptable experimental results but still can be improved.

In this work we will upgrade our framework to DeepEnergyJSV2.0 by applying the Proximal Policy Optimization (PPO) algorithm as an alternative to the original policy gradient algorithm in DeepEnergyJS to improve the efficiency of reducing energy consumption.

To validate DeepEnergyJSV2.0, we updated the data set from Alibaba Cluster Data V2017[11] to Alibaba Cluster Data V2018[12], which consists of hybrid-type tasks that contain both independent tasks and tasks with inner task dependencies. The original data set does not contain tasks with dependencies, which makes it not very convincing to validate DeepEnergyJSV2.0. With Alibaba Cluster Data V2018, we can verify DeepEnergyJS on hybrid-type tasks, and our simulation experiment is more in line with real-world cases.

2. Related Works

Many approaches have been proposed to reduce the energy consumption of data centers through task scheduling. A. Francis Saviour Devaraj et al. [13] proposed an algorithm based on best-worst (BWM) and the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) methodology, in which they presented a modified particle swarm optimization algorithm that achieves great results in balancing energy efficiency. Peng and Wen et al. [14] proposed an optimal task workflow scheduling scheme based on the dynamic voltage and frequency scaling technique and the whale optimization algorithm, which can achieve a balance between performance and energy consumption. However, these offline algorithms have difficulties dealing with online dynamic tasks and large inputs. The dynamics and complexity of an enterprise strategy environment make scheduling more challenging. Ding et al. [15] proposed a Q-learning-based task scheduling framework for energy-efficient cloud computing (QEEC) to minimize task response time and maximize each server’s CPU utilization simultaneously. Seth et al. [16] discussed the dynamic heterogeneous shortest job first (DHSJF) model, which considers both dynamic heterogeneities of workload and resources.

The task scheduling problem is an NP-hard problem, and various meta-heuristic algorithms can provide a feasible solution under certain conditions. The current research on cloud computing task scheduling focuses on independent tasks with traditional heuristic algorithms. Deep reinforcement learning (DRL) has attracted attention in recent years and has an outstanding ability to solve complicated control problems with high-dimensional state spaces and low-dimensional action spaces. Research work on how to apply DRL to obtain an efficient task scheduling strategy and make full use of system resources is missing.

In the study of dynamic task scheduling, the widely used methods are either heuristic algorithms or the DRL algorithm based on deep Q-networks. The problem of how to apply the DRL algorithm to cloud computing task scheduling strategy needs to be studied.

Compared with DQN, which adapts the ε-greedy strategy, the policy gradient can represent a random strategy and is free from the adjustment of the ε parameter. Based on this, DeepEnergyJS is the first system to present a policy-gradient-based task scheduling method to minimize energy consumption and improve the energy efficiency in a cloud computing system, which has also been proven to be effective for independent tasks and as well as tasks with dependencies. Our previous studies have exclusively focused on hybrid-type tasks that contain both independent tasks and tasks with inner task dependencies. In this paper, we prove that DeepEnergyJS is also applicable to hybrid-type tasks and adopt another gradient-based algorithm called proximal policy optimization (PPO) [17] to update DeepEnergyJS to DeepEnergyJSV2.0 for better performance.

3. Theory

3.1 Proximal Policy Optimization Algorithm

The policy gradient method is a type of reinforcement learning algorithm, the original version of the policy gradient algorithm, sample data based on the Monte Carlo method, and the variance of the estimated gradient will be higher. REINFORCE is an online learning gradient descent algorithm that minimizes the cost function, which means that the agent that interacts with the environment and the agent that updates its model parameters using environmental feedback is the same. In the algorithm training phase, the agent has to sample a batch of samples under policy π and update the parameters of the same policy network; for the next iterative learning, the agent needs to interact with the environment again to collect the new data, that is, interacting with the environment while updating the parameters of the policy network. There are numerous problems with this online learning approach.

1. It consumes a significant amount of time for the agent to resample new data for iterative parameter updating.

2. Previously collected data could not be reused, which in turn led to low data utilization. The PPO algorithm is another policy gradient algorithm that is suitable for continuous control problems [18], and it is simpler in its mathematical implementation compared to other policy gradient-based method (PGM)-based RL algorithms [19]. PPO is an offline learning method, and the strategy it adopts to interact with the environment and the strategy to be learned differ. The main idea of the PPO algorithm is to transfer online learning to offline learning based on importance sampling and adopt two networks to improve the network convergence rate. One policy network π' is used to process environment interactions and data gathering, and the other network tweaks its parameters by observing the effective interaction between π' and the environment. Different networks indicate different distribution functions, and the importance sampling mechanisms introduced to the sample from the original distribution. In the PPO algorithm, the original distribution is π'. When it is difficult to sample from the original distribution p(x), it can sample from another distribution q(x), and a weight (p(x))/(q(x)) can be multiplied to correct the difference between the two distributions. The derivation is shown in (1) to (3).

\(E_{x \sim p}[f(x)]=\int f(x) p(x) d x=\int f(x) \frac{p(x)}{q(x)} q(x) d x\) (1)

\(\int f(x) \frac{p(x)}{q(x)} q(x) d x=E_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]\) (2)

\(E_{x \sim p}[f(x)] \approx E_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]=\frac{1}{N} \sum_{i=1}^{N} f\left(x_{i}\right) \frac{p\left(x_{i}\right)}{q\left(x_{i}\right)}\) (3)

By importance sampling, the adverse effect of large deviation under the original distribution is improved, and the weighting factor acts as a regulator. To make importance sampling to be effective, the new and old distributions cannot differ significantly. Therefore, in the actual application of the PPO algorithm, a constraint is added to limit the difference between the two distributions.

3.2 The Derivation Process of Policy Gradient in PPO Algorithm

Policy π' provides collected data for policy π's parameter updating, which is the main idea behind the PPO algorithm. The derivation of the objective function J^𝜃′(θ) is given by (4).

\(\begin{aligned} &\mathrm{J}^{\theta^{\prime}}(\theta)=\mathrm{E}_{\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right) \sim \pi_{\theta}}\left[\mathrm{A}^{\theta}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right) \nabla \log \pi_{\theta}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)\right]\\ &=\mathrm{E}_{\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{\pi_{\theta}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right)}{\pi_{\theta^{\prime}}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right)} \mathrm{A}^{\theta^{\prime}}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right) \nabla \log \pi_{\theta}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)\right]\\ &=\mathrm{E}_{\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{\pi_{\theta}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)}{\pi_{\theta^{\prime}}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)} \frac{\pi_{\theta}\left(\mathrm{s}_{\mathrm{t}}\right)}{\pi_{\theta^{\prime}}\left(\mathrm{s}_{\mathrm{t}}\right)} \mathrm{A}^{\theta^{\prime}}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right) \nabla \log \pi_{\theta}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)\right]\\ &\approx \mathrm{E}_{\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right) \sim \pi_{\theta} /}\left[\frac{\pi_{\theta}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)}{\pi_{\theta^{\prime}}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)} \mathrm{A}^{\theta^{\prime}}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right) \nabla \log \pi_{\theta}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)\right] \end{aligned}\) (4)

In formula (4), \(\frac{\pi_{\theta}\left(\mathrm{s}_{\mathrm{t}}\right)}{\pi_{\theta^{\prime}}\left(\mathrm{s}_{\mathrm{t}}\right)}\) is generally ignored; the pro rata coefficient \(r_{\theta}=\frac{\pi_{\theta}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)}{\pi_{\theta^{\prime}}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)}\) denotes the difference between distribution π' and π. To narrow the difference, the theory TRPO [19] suggests using an adaptive KL penalty coefficient, as expressed in (5).

\(\mathrm{J}_{\mathrm{ppo}}^{\theta^{\prime}}(\theta)=\mathrm{J}^{\theta^{\prime}}(\theta)-\beta \mathrm{KL}\left(\theta, \theta^{\prime}\right) \approx \sum_{\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right)} \mathrm{r}_{\theta} \mathrm{A}^{\theta^{\prime}}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right)-\beta \mathrm{KL}\left(\theta, \theta^{\prime}\right)\) (5)

The KL divergence is used to quantify the two different distributions. If the two distributions are identical, the value of the total KL divergence is zero, whereas a smaller value indicates a higher similarity between the two distributions conversely, a larger discrepancy. The main objective is represented by (6).

\(\mathrm{J}_{\mathrm{ppo2}}^{\theta^{\prime}}(\theta) \approx \sum_{\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right)} \min \left(\mathrm{r}_{\theta} \mathrm{A}^{\theta^{\prime}}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right), \operatorname{clip}\left(\mathrm{r}_{\theta}, 1-\epsilon, 1+\epsilon\right) \mathrm{A}^{\theta^{\prime}}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right)\right)\) (6)

In formula (6), the function clip(𝑟𝜃, 1 − ϵ, 1 + ϵ) is displayed in Fig. 1. The function min (𝑟𝑟𝜃𝜃A𝜃𝜃′ (st, at), clip(𝑟𝜃, 1 − ϵ, 1 + ϵ)𝐴𝐴𝜃𝜃′ (st, at)) is displayed in Fig. 2

Fig. 1. Function clip(rθ, 1 − ϵ, 1 + ϵ)

Fig. 2. Function min (rθAθ′ (st, at), clip(rθ, 1 − ϵ, 1 + ϵ)Aθ′ (st, at))

The red line in the plot represents the objective function, when Aθ′ (st, at) is greater than 0, the probability of state-action pair (st, at) selected will increase, which means the value of πθ(st, at) will also increase, but the πθ(st, at)/πθ′ (st, at) ratio cannot exceed 1 + ϵ. Likewise, when Aθ′ (st, at) is less than 0, the probability of state-action pair (st, at) being selected will decrease which result in the decreasing πθ(st, at), and the πθ(st, at)/πθ′ (st, at) ratio can also not be lower than1 − ϵ. Furthermore, limiting the distance between π and π′ . Subsequent to the above analysis, the calculation of the policy gradient is shown in (7).

\(g_{b}(\tau)=\sum_{t=0}^{T} \min \left[\frac{\pi_{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta^{\prime}}\left(a_{t} \mid s_{t}\right)} \nabla \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \cdot(R(\tau)-b), \operatorname{clip}\left(\frac{\pi_{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta^{\prime}}\left(a_{t} \mid s_{t}\right)}, 1-\epsilon, 1+\epsilon\right) \cdot(R(\tau)-b)\right]\) (7)

4. Method

4.1 MDP Model

Task scheduling is an NP-hard problem, and the Markov decision process can provide a framework for modeling complex cloud task scheduling decision processes. As in our previous work, the state space is described by a list

S = [〈m₁,t₁〉,〈m₂,t₂〉, … ,〈m_i,ti〉, . . . ,〈m_total,t_total〉]

, where each element in the list is a 〈mi,ti〉 pair, indicating the task-instance ti can be scheduled to machine mi. The set of indexes of the state space list A = [1,2, … , i, . . . , total] is designed to denote the action space, action i means selecting 〈mi,ti〉 pair to allocate ti to mi. The added value of the power of the data center after the current acts multiplies (-1) represents the reward signal. The extracted properties in task ti and machine mi were presented in our previous work [9].

4.2 Formulation of Objective Optimization

The major research objective of this work is to propose an energy-minimizing method to lower energy consumption in a data center. The energy consumption for each machine can be accumulated by the instantaneous power. The instantaneous power of time t is determined using (8).

\(\operatorname{power}(\mathrm{u})=\mathrm{P}_{\text {idle }}+\left(\mathrm{P}_{\text {busy }}-\mathrm{p}_{\text {idle }}\right) \cdot \mathrm{u}^{\mathrm{r}}\) (8)

While P_idle represents static power, Pbusy the maximum power, u is the CPU usage, r is determined by the machine type, and it should be obtained from the best-fit curves. Assuming that there are M machines, the total energy consumption is calculated using (9).

\(\mathrm{E}=\sum_{\mathrm{i}=1}^{\mathrm{M}} \sum_{\mathrm{t}=0}^{\mathrm{T}_{\mathrm{i}}} \operatorname{power}(\mathrm{u})_{\mathrm{t}}\) (9)

4.3. The Overall Framework of Method

The overall framework of the task-scheduling method is shown in Fig. 3. Because the simulation experiments are initiated at each instance of time, when tasks initiate, the state space is created. For every taskInstance_i arriving, if machine_j satisfies the resource demand of taskInstance_i, the pair < machine_j,taskInstance_i > is then incorporated into the list of the state space. It is worth noting that if machine_k also meets the requirements, pair < machine_k,taskInstance_i > is also required to be added to the state space list. All of the machine-task instance pairs in the state space are directly input into the neural networks, and the fitness for each pair will be output. Suppose that < machine_action,taskInstance_action > has the maximum fitness, taskInstance_action will be scheduled to machine_action. Based on this, the state space is re-constructed for the remaining unscheduled task instances until the observed state space is empty. It is worth noting that time will not elapse during the period when the state space is not empty. There is more than one method for the neural networks to update the parameters, and in our previous study, REINFORCE was adopted; in this study, we performed the PPO algorithm and verified the factor validity first, and then contrast the two algorithms in terms of convergence rate, converged value, and stability. The flow chart of the experiment is shown in Fig. 4.

Fig. 3. The Overall Framework of task scheduling method.

Algorithm 1. The procedures of the simulation.

Fig. 4. The flow chart of the experiment

5. Experiment

5.1 Experimental Environment

Deep learning approaches require large amounts of memory because of the computation intensity. In this study, we performed the simulation experiments on a 32 GB computer running 119-Ubuntu (x86_64) with an Intel Xeon E5-2667 (3.20 GHz) processor with 16 cores. The method was simulated using Python and Python libraries such as Matplotlib [20], simply3[21], pandas [22], NumPy[23], and TensorFlow[24]. Programs were developed by JetBrains PyCharm 2020. In order to compute the energy consumption during the experimentation process based on (9), we downloaded data from some mainstream servers that represent correspondence between CPU utilization and real power from the website (http://www.spec.org/power_ssj2008/results/power_ssj2008.html), as shown in Table 1 to fit the EM power model and the fitted curve is depicted in Fig. 5. The EM power model parameters r for each kind server is as follows: 0.9569, 0.7257, 1.5767, 0.7119 and 1.5324.

Table 1. The power of selected server at different CPU utilization

Fig. 5. Fitted curve for the EM power model

The experiments mainly focus on the task scheduling problem in heterogeneous cloud environments, and the number of CPU cores and memory units are listed in Table 2.

Table 2. Machines configuration

5.2 Dataset

In this study, the data set from Alibaba Cluster Data V2018[20] was used as the benchmark dataset. Compared to Alibaba Cluster Data V2017[11], V2018 has both independent tasks and tasks with dependencies. In practice, we divide the jobs in V2018 into several chunks, and each chunk has 10 jobs arriving in sequential order. We trained DeepEnergyJSV2.0 on the first six chunks. The number of job chunks seen by DeepEnergyJSV2.0 and the number of training iterations are accumulated. It is important to note that the number of tasks in each job chunk varies and that each task contains a different number of task instances, which means that the workload varies with the change of time.

5.3. Experimental Results and Analysis

5.3.1 Convergence and Generalization of PPO Algorithm

In this section, we compared the PPO algorithm with the REINFORCE algorithm in the same training procedures (iteration: 300). The training curve is shown in Fig. 6.

Fig. 6. Training curves of PPO algorithm and REINFORCE algorithm

The figure shows the convergence of PPO proved by the experimental results. In the case of algorithm stability, the curve of PPO is smoother than that of REINFORCE, which indicates better performance. In terms of convergence speed, except for Job Chunk No.4, the number of iterations of PPO to achieve convergence is less than that of REINFORCE. For the converged value, the two algorithms have similar values in job chunks No.1, No.2, and No.4. However, in Job Chunk No.3, No.5, and No.6, PPO has lower energy consumption. From the overall training process, PPO has more advantages over REINFORCE.

5.3.2 PPO and REINFORCE Comparison

In this section, we compare how well the PPO and REINFORCE perform on the test job chunks. Table 3 lists the energy consumption for job chunks No.7 to No.10 in tabular form, and the line diagram is presented in Fig. 7.

Table 3. Energy consumption on test set

Fig. 7. Energy consumption on test set

As shown in Fig. 7, the PPO algorithm performs better than the other algorithms, that is, REINFORCE, First Fit, Random, and Tetris, and REINFORCE performs the second best. In addition, it emphasized that deep reinforcement learning (DRL) algorithms have significant advantages over general heuristic algorithms in solving difficult cloud task scheduling problems.

Table 4. Percentage reduction of energy consumption that PPO compare to the other algorithms

Table 4 demonstrates the percentage of energy consumption decreased for the PPO compared to other algorithms. Compared to REINFORCE, PPO reached a maximal reduction of approximately 7.814%, 8.932% versus First Fit, 13.864% compared to Random, and 8.698% to Tetris. For all of the outcomes from the different job chunks, the PPO algorithm was verified as the most effective among the scheduling algorithms used in this study, and the generalization of the method was also validated.

6. Conclusion and Future Work

With the goal of minimizing energy consumption, the work in this paper is an extension and based on previous work. We changed the DRL algorithm included in the task scheduling method from REINFORCE to PPO to update the previous DeepEnergyJS to DeepEnergyJSV2.0. The experimental results show that DeepEnergyJSV2.0 achieves excellent results in hybrid-type tasks that contain both independent tasks and tasks with inner task dependencies to optimize the objective of energy consumption. In addition, DeepEnergyJSV2.0 achieves better overall performance on the training set and can find a better near-global optimal solution than the common heuristic algorithms, such as First Fit, Random, and Tetris, on the test set.

If applied in an actual cloud environment, our proposed DeepEnergyJSV2.0 will be an alternative to many existing task scheduling methods based on heuristic algorithms. However, the shortcomings of our research and directions for future studies are as follows:

1. Real cloud computing environments are more complex, but our research was conducted only for a single data center.

2. The research in this study is only aimed at the optimization of a single goal. In practice, a comprehensive consideration of multi-objective optimization will be more rewarding.

3. Containerization technology is an emerging virtualization technique that plays an increasingly important role in the future. Developing efficient strategies for scheduling tasks in containers is a focus for future research.

4. The performance of PPO is not stable enough. As shown in Fig. 6, the results of PPO are not better than REINFORCE in job chunks No.1, No.2, and No.4. The reason is that the task types in different job chunks differ. This result indicates that the performance of the PPO algorithm is not superior to the reinforcement learning algorithm in some specific scenarios. These specific scenarios need to be studied in the future to determine their underlying patterns.

5. In the future, cloud computing needs to combine with other computing model like edge computing, fog computing, serverless computing and quantum computing [25]. How to integrate these computing models to complete computing tasks and use AI/ML to optimize them is the main research direction in the future, which also poses a huge challenge to us.

Acknowledgement

This paper is financially supported by Shandong Key R&D Program (Major Science and Technology Innovation Project) (2020CXGC010704), Key R & D Projects of Shandong Province(2020JMRH0201), Key Projects of New and Old Kinetic Energy Conversion 2020, Qingdao independent innovation major project(20-3-2-12-xx), project of introducing urgently needed talents in key supported regions of Shandong Province.

참고문헌

Y. Yin, Y. Xu, W. Xu, M. Gao, L. Yu, and Y. Pei, "Collaborative Service Selection via Ensemble Learning in Mixed Mobile Network Environments," Entropy, vol. 19, no. 7, Jul. 2017.
J. Yu, Z. Kuang, B. Zhang, W. Zhang, D. Lin, and J. Fan, "Leveraging Content Sensitiveness and User Trustworthiness to Recommend Fine-Grained Privacy Settings for Social Image Sharing," IEEE Transactions on Information Forensics and Security, vol. 13, no. 5, pp. 1317-1332, May 2018. https://doi.org/10.1109/tifs.2017.2787986
J. Yu, B. Zhang, Z. Kuang, D. Lin, and J. Fan, "iPrivacy: Image Privacy Protection by Identifying Sensitive Objects via Deep Multi-Task Learning," IEEE Transactions on Information Forensics and Security, vol. 12, no. 5, pp. 1005-1016, May 2017. https://doi.org/10.1109/TIFS.2016.2636090
L. Y. Zuo and Z. B. Cao, "Review of scheduling research in cloud computing," Application Research of Computers, vol. 29, no. 11, pp. 4023-4027, 2012. https://doi.org/10.3969/j.issn.1001-3695.2012.11.005
C. J. C. H. Watkins, "Learning from delayed rewards," Ph.D. dissertation, King's College, Cambridge United Kingdom, 1989.
V. Mnih et al., "Playing Atari with Deep Reinforcement Learning," arXiv:1312.5602 [cs], Dec. 2013.
Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, May 2015. https://doi.org/10.1038/nature14539
J. Stuart, Norvig, and Peter, Artificial Intelligence: A Modern Approach, 1995.
C. He, Y. Yang, and B. Hong, "Cloud Task Scheduling Based on Policy Gradient Algorithm in Heterogeneous Cloud Data Center for Energy Consumption Optimization," in Proc. of 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), pp. 1-5, Nov. 2020.
R. J. Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning," Mach Learn, vol. 8, no. 3, pp. 229-256, May 1992. https://doi.org/10.1007/BF00992696
C. Lu, K. Ye, G. Xu, C.-Z. Xu, and T. Bai, "Imbalance in the cloud: An analysis on Alibaba cluster trace," in Proc. of 2017 IEEE International Conference on Big Data (Big Data), pp. 2884-2892, Dec. 2017.
"Alibaba Cluster Trace Program," Alibaba, 2021. Accessed: Jan. 04, 2022. [Online]. Available: https://github.com/alibaba/clusterdata/blob/4221e02342dd01fd30a9800b19b7f365a3fd5ac8/cluster-trace-v2018/trace_2018.md
A. F. S. Devaraj, M. Elhoseny, S. Dhanasekaran, E. L. Lydia, and K. Shankar, "Hybridization of firefly and Improved Multi-Objective Particle Swarm Optimization algorithm for energy efficient load balancing in Cloud Computing environments," Journal of Parallel and Distributed Computing, vol. 142, pp. 36-45, Aug. 2020. https://doi.org/10.1016/j.jpdc.2020.03.022
H. Peng, W.-S. Wen, M.-L. Tseng, and L.-L. Li, "Joint optimization method for task scheduling time and energy consumption in mobile cloud computing environment," Applied Soft Computing, vol. 80, pp. 534-545, Jul. 2019. https://doi.org/10.1016/j.asoc.2019.04.027
D. Ding, X. Fan, Y. Zhao, K. Kang, Q. Yin, and J. Zeng, "Q-learning based dynamic task scheduling for energy-efficient cloud computing," Future Generation Computer Systems, vol. 108, pp. 361-371, Jul. 2020. https://doi.org/10.1016/j.future.2020.02.018
S. Seth and N. Singh, "Dynamic heterogeneous shortest job first (DHSJF): a task scheduling approach for heterogeneous cloud computing systems," Int. j. inf. tecnol., vol. 11, no. 4, pp. 653-657, Dec. 2019. https://doi.org/10.1007/s41870-018-0156-6
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal Policy Optimization Algorithms," arXiv:1707.06347 [cs], Aug. 2017.
V. Mnih et al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, pp. 529-533, Feb. 2015. https://doi.org/10.1038/nature14236
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, "Trust Region Policy Optimization," in Proc. of the 32nd International Conference on Machine Learning, pp. 1889-1897, Jun. 2015.
J. D. Hunter, "Matplotlib: A 2D Graphics Environment," Computing in Science & Engineering, vol. 9, no. 03, pp. 90-95, May 2007. https://doi.org/10.1109/MCSE.2007.55
"SimPy," Team SimPy, 2020. [Online]. Available: https://simpy.readthedocs.io/en/latest/index.html
W. McKinney, "Data Structures for Statistical Computing in Python," in Proc. of the 9th Python in Science Conference, pp. 56-61, 2010.
S. van der Walt, S. C. Colbert, and G. Varoquaux, "The NumPy Array: A Structure for Efficient Numerical Computation," Computing in Science Engineering, vol. 13, no. 2, pp. 22-30, Mar. 2011.
Martin Abadi et al., "TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems," 2015. [Online]. Available: https://www.tensorflow.org/
Sukhpal Singh Gill et al., "AI for Next Generation Computing: Emerging Trends and Future Directions," Internet of Things, 2022.

KSII Transactions on Internet and Information Systems (TIIS)

Cloud Task Scheduling Based on Proximal Policy Optimization Algorithm for Lowering Energy Consumption of Data Center

초록

키워드

1. Introduction

2. Related Works

3. Theory

3.1 Proximal Policy Optimization Algorithm

3.2 The Derivation Process of Policy Gradient in PPO Algorithm

4. Method

4.1 MDP Model

4.2 Formulation of Objective Optimization

4.3. The Overall Framework of Method

5. Experiment

5.1 Experimental Environment

5.2 Dataset

5.3. Experimental Results and Analysis

5.3.1 Convergence and Generalization of PPO Algorithm

5.3.2 PPO and REINFORCE Comparison

6. Conclusion and Future Work

Acknowledgement

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)