DOI QR코드

DOI QR Code

EHMM-CT: An Online Method for Failure Prediction in Cloud Computing Systems

  • Zheng, Weiwei (State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications) ;
  • Wang, Zhili (State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications) ;
  • Huang, Haoqiu (State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications) ;
  • Meng, Luoming (State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications) ;
  • Qiu, Xuesong (State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications)
  • Received : 2015.11.10
  • Accepted : 2016.07.29
  • Published : 2016.09.30

Abstract

The current cloud computing paradigm is still vulnerable to a significant number of system failures. The increasing demand for fault tolerance and resilience in a cost-effective and device-independent manner is a primary reason for creating an effective means to address system dependability and availability concerns. This paper focuses on online failure prediction for cloud computing systems using system runtime data, which is different from traditional tolerance techniques that require an in-depth knowledge of underlying mechanisms. A 'failure prediction' approach, based on Cloud Theory (CT) and the Hidden Markov Model (HMM), is proposed that extends the HMM by training with CT. In the approach, the parameter ω is defined as the correlations between various indices and failures, taking into account multiple runtime indices in cloud computing systems. Furthermore, the approach uses multiple dimensions to describe failure prediction in detail by extending parameters of the HMM. The likelihood and membership degree computing algorithms in the CT are used, instead of traditional algorithms in HMM, to reduce computing overhead in the model training phase. Finally, the results from simulations show that the proposed approach provides very accurate results at low computational cost. It can obtain an optimal tradeoff between 'failure prediction' performance and computing overhead.

Keywords

1. Introduction

A large number of data centers are built in modern cloud computing systems. In general, these data centers are constructed using commodity servers, where commodity components are developed by different manufacturers. Due to the highly complex nature of the underlying infrastructure connected with the components, these systems thus incur a high risk of encountering failures and exceptions [1][3][12]. Data centers as the mainstay to efficiently meet the demand for service providers’ cloud-based service raise the challenges of high availability and scalability. Such data centers usually carry a number of high-performance computing (HPC) applications, which distribute and replicate data on several nodes in order to meet stringent Quality of Service (QoS) requirements regarding the high availability of systems. As such, failures are inevitable and may lead to catastrophic consequences in the whole system. Specifically, failures in system nodes can abort applications, which usually span various nodes, resulting in little forward progress [2]. When the dynamics of the systems are also examined, susceptibility to cascading failures is revealed, which in particularly serious cases causes the entire system to be affected.

Failures that play a crucial role must be handled promptly to ensure system survivability and reliability in cloud computing systems. An appropriate method to avoid failures is to predict them, sensing the occurrence of anomalous behavior. Accurate and timely predictions also can mitigate the effect of failures by taking proper recovery actions before failures occur [12]. Liang et al. [2] have shown that the capability of predicting the time/location of the next failure, though not perfect, can considerably boost benefits of other runtime fault tolerance techniques.

As an innovative approach to further enhance system dependability and availability, failure prediction anticipates failures before occurring, and performs preventive strategies or reduces time-to-repair by preparation for imminent failures [8]. Failure prediction does not intend to identify the root cause of problems, rather its goal is to evaluate the current system state and to estimate the possibility of the failure occurring within a short future.

Current techniques for failure prediction and anomaly detection mainly focus on pattern matching and statistical analysis schemes, where failures or anomalies are considered as deviations from the normal behavior and are modeled regarding variability in systems [9][8]. Specifically, online machine learning and data mining are exploited to model behavior of systems, using error sequences or symptom-specific features, such as CPU and memory utilization. As one of the classic pattern matching models, the Hidden Markov Model (HMM) has been successfully applied to various pattern recognition tasks, for example, speech recognition, genetic sequence analysis as well as dependable computing including intrusion detection, fault diagnosis, and network traffic models [10].

Although HMMs have been extended for different scenarios and requirements, these extensions are not appropriate for the symptom monitoring-based failure prediction approach [10]. Additionally, HMM and its extensions usually apply machine learning techniques to model training and pattern identification. The computational complexity of the training is high because each step of the long iterative process re-estimates all the parameters numerically. To address the problem this paper exploits Cloud Theory (CT) because CT has been proven useful for solving the uncertain transitions between quantitative values and qualitative terms in training prediction models [11][17].

This paper focuses on online failure prediction that evaluates a current state of systems and makes a short-term failure prediction, supported by runtime symptoms monitoring-based methods. In this work, the time-varying characteristics of the system are referred to as indices and form the basis of the failure prediction. Specifically, this paper proposes a failure prediction approach based on an extended HMM and CT (EHMM-CT). This method extends HMM and trains the model with CT. By analyzing the runtime indices that represent the states of the systems, the proposed EHMM-CT models the behavior of the systems. The main contributions of this paper are summarized as follows.

The rest of this paper is organized as follows. Section 2 shows a short review of HMM and CT, respectively. Then, the proposed EHMM-CT is presented in detail in Section 3. Section 4 provides the online failure prediction model design. Section 5 describes the evaluation environment and presents the performance of proposed EHMM-CT. Section 6 presents the related work. Finally, conclusions and future work are highlighted in Section 7.

 

2. Preliminaries

In this section, a brief overview of HMMs and CT is presented, respectively. In particular, for the CT, it shows the calculation process of likelihood in detail that can be applied directly to the EHMM-CT.

2.1 Review of Hidden Markov Models

HMMs are an extension to the Discrete Time Markov Chain (DTMC), which consists of three elements: i) a state set S = {si}(i=1,...,N) containing N states; ii) a square matrix A = {aij}(i,j=1,...,N) defining transition probabilities between the states; and iii) a vector π = {πij} specifying the initial state probabilities. Additionally, HMMs are also determined by other two quantities, the symbol set O and the emission probability distribution B. Specifically, O = {oi}(i=1,...,P) is a finite, countable set, containing P different symbols. B = {bij}(i=1,...,N;j=1,...P) is a stochastic matrix, where bij is the probability for emitting symbol oj, given that the stochastic process is in state si. For better readability, bij may sometimes be denoted by bi(oj). In HMMs, only the outputs can be measured from outside, and the state of the stochastic DTMC process is hidden from the observer. Therefore, an HMM is usually described by a tuple of five elements HMM={S, O, A, B, π}, simply expressed by λ = {A, B, π} [10].

Standard HMM algorithms such as Baum-Welch and Forward-Backward [10] are adopted for model training (i.e., adjusting λ from a set of training sequences) as well as for efficiently computing sequence likelihood. Nonetheless, these algorithms are based on machine learning and estimate parameters by repeated recursions and iterations, which may need extensive computation and timely overhead.

2.2 Cloud Theory Overview

CT is a model describing the transition between quantitative data and qualitative term. It has been successfully applied to many areas [11][4][16][17][13], including for example, natural language processing, data mining, decision analysis, intelligent control, and image processingThe details of CT are presented below.

Cloud Model [11]. Let U be the universe of a discourse set, and C be the qualitative term associated with U. The membership degree of any element x in U to the qualitative term C, denoted by membership(x,C), is a random number with a stable tendency taking the values in [0, 1]. Thus, the distribution of x in U is defined as cloud model, and each x is called a cloud drop.

Normal Cloud Model [11]. If x in U satisfies: i) the distribution of x in U is a cloud model; ii) x~N(Ex,En'2)where En'2~N(En,He2), and iii) the membership degree of x to the qualitative term C is , then the distribution of x in U is defined as normal cloud.

Feature Vector of Cloud. A normal cloud can be expressed by a tuple of independent parameters C=(Ex,En,He), called the feature vector. In the feature vector, the expected value Ex represents the overall level of the cloud model, while the entropy En presents the discrete degree, and hyper-entropy He denotes the uncertainty.

Cloud Generator [4][13]. Cloud generators are models realizing the transition between quantitative values and qualitative concept, and consist of forward cloud generator (CG) and backward cloud generator (CG-1).

Given a cloud model C=(Ex,En,He), the CG is used to generate several cloud drops drop(xi,membership(xi,C))based on the feature vector; while CG-1 extracts the feature vector of cloud model from cloud drops.

Likelihood. Define likelihood likelihood(C1,C2) between two given cloud models C1=(Ex1,En1,He1) and C2=(Ex2,En2,He2) as the total distances of drops in the clouds, whose range is [0,1]. The details are shown in Algorithm 1.

Algorithm 1.Calculation process of likelihood.

 

3. EHMM-CT: An Extension of Hidden Markov Model

The proposed approach builds on the fundamental assumption that characteristic patterns of symptoms identify the failure-prone behavior. The goal for failure prediction is to evaluate the current system state by taking into account the risk that a failure occurs within a short interval in the future, regardless of the fault causing the failure.

EHMM-CT is a novel extension of HMMs, which is appropriate for failure prediction in cloud computing systems. Approaches in CT are introduced to model training, which can significantly reduce the amount and complexity of computation. EHMM-CT aims to model system behavior using historical runtime data. It estimates the probability of the failure occurrence using system state matching. The details of EHMM-CT are shown as follows.

3.1 What Is New about EHMM-CT

The proposed EHMM-CT is designed for failure prediction in cloud computing systems. EHMM-CT extends the HMMs presented in 2.1 in three aspects.

3.2 Parameters

The first extension builds on the fact that data collection for representing the states of systems may involve various indices, such as CPU utilization, memory utilization, and I/O requests. This paper, to characterize different states of these indices, extends the state set from a vector to a matrix S={sij}sn×N, where N represents the number of indices, and sn is the maximum size of index states, thus sij presents the i th state of the j th index. For better readability, is defined as the system state at the t th slot.

Also, a cube A={Ak}N is defined as the transition probabilities between states, where Ak={ai,j}snk×snk is a stochastic matrix, satisfying: , where snk is the size of a state set for index k.

A sn×N matrix of initial state probabilities π={πij} has to be specified. For each column j of π, must be satisfied.

A description of the stochastic process of EHMM-CT follows. An initial state for each index k is chosen in light of the probability distribution defined by π. Starting from the initial state, the process transits from one state to the next according to the transition probabilities defined by A. Therefore, EHMM-CT satisfies the termed Markov assumptions, which can be expressed by the following equation:

where i,j=1,...,snk;k=1,...N.

The characteristics of Markov assumptions mean that: since the transition probability only depends on the immediate state, it has no memory of the states the process has travelled through. It is assumed that for a certain index, each state could reach any other state by one step.

In this work, it instantiates the set of symbols O={oi}. Specifically, in cloud computing systems, the collected indices can be divided into three degrees: normal, alarm and failure, where normal represents that the runtime indices are in the normal condition; alarm denotes the abnormal of system resources and is relevant to a set of symptoms. Alarm can be classified in detail by the severity as critical alarm, major alarm, minor alarm, and warning. Alarm generally implies that part of the system indices are in abnormal but may not affect the running of the system [7]. Failure is the observation when the system is in malfunction. Hence, the observation set of the EHMM-CT is defined as O={N,Aworning,Aminor,Amajor,Acritical,F}. Also, it defines Oi as the observations at the t th slot.

Moreover, a stochastic triple cube B={Bi}N is defined. For each element Bk={bij}snk×P, the row i represents a probability distribution for state sik. Specifically, bij is the probability for emitting symbol oj, given that the stochastic process is in state ski, i.e., . Hence, Bk has dimensions of snk×P and satisfies , where P is the size of symbols set O.

For requirements of failure prediction in cloud computing systems, ω={ω1,...,ωN} is defined to represent correlations between multiple indices and the occurrence of failures, where the i th element of the vector is the weight probability of index i causing a failure. It constrains that

In Eq. (2), S*i represents the value space of and corresponds to the i th column of S, i.e., S*i∈{S1i,...,Ssnii}. Thus, is the weight probability of index i causing a failure.

In summary, the EHMM-CT is exactly defined by EHMM-CT={S,A,B,π,ω}, simply expressed by λ={S,A,B,ω}.

 

4. Failure Prediction Model Design

In this section, the authors present how the model EHMM-CT={S,A,B,π,ω} is trained and used for failure prediction. The EHMM-CT is called “hidden” since it is assumed that only the generated symbols can be observed and that the state of the stochastic process is hidden from the observer. In the case of failure prediction, the “hidden” can be mapped to the fundamental concepts of faults and symptoms.

The proposed failure prediction approach consists of two phases: model training and failure prediction. The former mainly addresses how to adjust the model parameters so as to model the behavior of systems, whereas the latter focuses on the online failure prediction. Fig. 1 shows the detail.

Fig. 1.Online failure prediction based on EHMM-CT.

4.1 Model Training Phase

In the model training phase, we focus on the estimation of model parameters reflecting the features of the runtime indices and behavior of the system. Instead of the standard HMM training algorithms (Forward-Backward and Baum-Welch which acquire the parameters through repeated iterations and recursions [10]), approaches in CT are used to estimate the parameters. By CT, the feature vector associated with the statistical characteristics of the training samples can be extracted. Then, the parameter set of EHMM-CT is estimated via the maximum membership degree and the maximum likelihood in CT and represents the “characteristic vector” associated with the training samples distribution.

4.1.1 State Division

In the state division, the model aims to instantiate the state set S. In particular, it defines each element in S corresponding to a cloud model, called state cloud. Firstly, the number and boundaries of divided state clouds for each index are calculated by estimating the index values (see later step (a)-(c)); and then the feature vector for each state cloud is extracted (see later step (d)). The details are as follows.

A failure prone state for each index j, denoted by , is selected from the generated state sets above. The failure prone state for each index j is the one whose feature vector deviates most from the normal, and thus the possibility of triggering failures is the maximum.

4.1.2 Transition Probability

The cloud likelihood represents the similarity between clouds and is quantified by a value between 0 and 1. It is acquired by adding up the distances of drops from clouds (as shown in Algorithm 1). In this model, since the states of the index are presented by state clouds, the cloud likelihood can be exploited when estimating the state transition probabilities. For index k, the transition probability distribution Ak is calculated by

where aij is the probability of state sik transiting to state sjk, and likelihood(ci,cj) is the likelihood of clouds ci and cj (see Section 2.2) and its range is [0, 1].

4.1.3 Emission Probability

The emission probability indicates the possibility of getting a certain observed symbol at the specified state. In EHMM-CT, the membership degree in CT is exploited to describe the emission probability. The bigger the membership degree is, the more the fitting is presented between the observation and states (i.e., the more possibilities to output the observation at the corresponding state), and vice versa.

Therefore, it calculates the membership degrees for the index data collected at different slots. Then, the observation probability is defined as the average of membership degrees. The formal process is shown in Algorithm. 2.

Algorithm 2.Estimating the emission probability distribution.

4.1.4 Weight Probability

The correlations between indices and failure occurrences are presented in three aspects: DF (Dispersion Frame) [15], FDI (Failure Dispersion Index), and Rate. For each index j, a DF is the interval time (number of slots) between successive alarms, symbolized by T=(Δτ1,Δτ2,...); FDI is defined as the number of failures N=(n1,n2,...) observed in a DF; and Rate is defined as the ratio of failure encountered DFs over all DFs, that is, Rate represents the correlation between the alarms of index j and failure occurrences and is expressed by γj. Fig. 2 shows how to obtain the DF, FDI, and Rate, respectively.

Therefore, the relationship among the index weight ω, DF, FDI, and Rate is presented as follows:

Fig. 2.Method to determine indices weights. For example, the values of DF, ADI, and Rate for index A, are Ta=[4,6,9], Na=[1,0,1], γa=2/3, respectively; For index B and C, Tb=[8,4,2,6], Nb=[1,0,0,1], γa=2/4; Tc=[8,10], Nc=[1,1], γc=1.

Where and are the average of DF and FDI for index j, respectively.

Finally, we normalize the weight probability, according to the formula (10). Thus, the weight probability ω={ω1,...,ωN} satisfies:

where .

4.2 Failure Prediction Phase

4.2.1 Preprocess of Runtime Indices

In this section, the current states of indices, named index states, are evaluated and represented as clouds. In particular, it extracts the feature vector of index state for each index based on collected runtime data matrix X={xij}T×N. Note that, for the matrix, the row represents the relative collected time of the runtime data, i.e., the bigger the row index is, the fresher the data is.

To evaluate the timeliness of runtime system indices, a constant is defined as the time impact factor TIF to differentiate the impacts of index data collected at different slots on failures. Recently collected data for failure prediction is more valuable than previously obtained data- see later equation (12).

For the collected runtime indices data X={xij}T×N, the feature vector of index states is estimated based on CG-1 while considering each index sample xij as a cloud drop. The process of the evaluation is shown as follows.

Thus, the current index states are denoted by IS={isl,...,isN}, where isj=(Exj,Enj,Hej).

4.2.2 Failure Probability Evaluation

In the failure probability evaluation, the model needs to estimate the possibility of the failure occurrence based on the trained EHMM-CT and current index states IS={isl,...,isN} acquired in Section 4.2.1. The details are as follows.

 

5. Experiments and Analysis

5.1 Evaluation environment

Online prediction of eventually upcoming failures is shown in Fig. 3. If a prediction is performed at time t, it would like to know whether during [t+Δtl,t+Δtl+Δtp] a failure will occur or not, based on the trained prediction model and data within the data window Δtd [15].

Fig. 3.Online failure prediction: t is the present time; Δtd represents the data window size; Δtl denotes the lead-time; Δtw is the warning time and Δtp symbolizes the prediction period.

The performance of the proposed approach is evaluated via MATLAB with runtime index data collected from our laboratory-wide cloud computing system. The system contains 27 high-performance and connected computing servers dedicated to computational research. The computing servers are virtualized into a cloud computing resource pool. Multiple virtual machines are created using the virtual resource carrying parallel or cooperative applications, such as distributed data mining algorithms and research simulations. In the simulations, performance indices of the cloud computing system like CPU utilization, memory utilization, and I/O rate, are extracted for the failure prediction. A failure is defined as the event when a system ceases to fulfill its specification [15]. Fig. 4 presents the number of alarms per slot (which is equal to the measurement period of the system, i.e., 30s) in part of test data. As shown in the figure, the amount of alarms per slot varies heavily, and there is a correlation between the occurrence of failures (represented by triangles in Fig. 4) and alarms. Fig. 5 is a histogram of time-between-failures (TBF). It can be seen from the histogram of TBF that the distribution of failure is wide-spread. There is no periodicity evident about the failure occurrence. Fig. 4 and Fig. 5 show that, in the simulations, the causality between alarms and failures can be directly observed, and the distribution of failures is random and not uniform.

Fig. 4.Number of alarms per slot in part of test data. (Triangles represent the occurrences of failures)

Fig. 5.Time between failures (TBF).

5.2. Performance

5.2.1 Feasibilities

In this section, the execution scenario is designed and the feasibilities of the EHMM-CT for failure prediction are verified. The steps of the simulation are as follows.

Fig. 6.Generation of state clouds.

Fig. 7.Distributions of the transition probability and observation probability.

Table 1.Probability distributions of the state matching and triggering a failure.

5.2.2 Effectiveness

For evaluating the performance of the proposed EHMM-CT, metrics in [15] including precision, recall and F-measure are introduced. A perfect failure prediction is to achieve a one-to-one matching between predicted and actual failures which results in precision=recall=1 and fpr=0 [8].

Additionally, there is an inverse proportionality between high recall and high precision, which means that improving recall, in most cases, leads to a lower precision and vice versa. Then, the same inverse proportionality is for recall and fpr. The proposed EHMM-CT provides a customizable failure prediction threshold ϑ for classification. By varying the threshold, the inverse proportionality can be controlled. Fig. 8 is the precision/recall plot and recall/fpr (so-called ROC plot), which are plotted by fluctuating the value of parameters like Δtd, Δtl and Δtp. The proposed approach provides a customizable threshold ϑ by which the trade-offs, e.g., between precision and recall, can be controlled.

Fig. 8.Failure prediction performance of the EHMM-CT approach. The various symbols are different parameter settings.

Fig. 9 illustrates the accumulated runtime cost for the system. It shows a run of 18 slots containing 14 failures, denoted by black triangles. For comparison, we added two curves: “perfect prediction” and “no prediction”. The former refers to the case where a “perfect failure predictor” is to predict all failures without any mis-prediction. The latter refers to a system with neither a predictor nor any reaction scheme in place; once a failure occurs, execution costs increase by 5, which is the overhead for false negatives. We use the SEP approach proposed in [15] as a comparison.

Fig. 9.Accumulated cost for each technique. Cost of 1 has been assigned to true positives, 2 to false positives, and 5 to false negative predictions. Triangles represent failures that occurred in the data set.

As shown in Fig. 9, the performance curve of proposed EHMM-CT is closed to the cost curve of perfect prediction, and can predict most of the failures (out of 14 failures, 13 were predicted), which reduces the cost for handling failures. As time goes on, the accumulated execution cost of the proposed approach is less than the SEP in [15], and the EHMM-CT performs advantages in prediction accuracy.

Table 2 is the summary of performance compared with HSMM in [8] and FABM in [18], taking into account the impact of thresholds on the performance of EHMM-CT. The table shows the prediction performance with different threshold settings. These settings are based on the statistical characteristics of the prediction probabilities outputted by the EHMM-CT during training, including the average (AVE), the minimum (MIN) and the median (MED) of the prediction probabilities. As shown in the table, EHMM-CT with different settings provides better prediction performance than HSMM. In particular, EHMM-CT with MED achieves the maximum F-measure of 0.8058 with precision=0.8348, recall=0.7787. Note that the EHMM-CT allows employing customizable thresholds by which the trade-off between precision and recall can be adjusted for the various requirements. For example, for systems where once a failure occurs the processing cost is heavy, it is necessary to set lower thresholds to get a higher recall and to prevent the omission of failures. Thus the setting as MIN is preferred; whereas for a system where the occurrence of failures is frequent, and a higher precision is required, it is appropriate to set as AVE.

Table 2.Summary of prediction results.

The EHMM-CT can be trained without complex preprocessing of training data, and the computations of the model parameters estimation can be significantly reduced, compared with other prediction techniques. Fig.10 is the plot of average fpr versus the average training time about EHMM-CT, HSMM in [8] and FABM in [18]. As shown in the figure, the FABM achieves the best average fpr (the average fpr is about 0.05) with the most training time (about 9 slots). The HSMM gets the worst prediction performance regarding average fpr but a faster convergence (about 7 slots) than the FABM. The EHMM-CT achieves the fastest convergence, that is, getting an acceptable prediction result (the average fpr is less than 0.1) with the least training time (about 5 slots). These can be explained because, for the FABM as well as the HSMM, they make a failure prediction using machine learning techniques to model training and pattern identification. Because each step of the long iterative process re-estimates all the parameters numerically, the computational complexity of the training for FABM and HSMM is higher, thus incurring longer training time. Among the three prediction approaches, EHMM-CT obtains an optimal trade-off between outstanding failure prediction performance and reduced computing complexity, which means that the time is significantly reduced under the promise of an acceptable prediction performance.

Fig. 10.Fpr versus training time for each prediction approach.

 

6. Related Work

In recent years, failure prediction and anomaly diagnosis have been paid special attention by ISPs and the network research community to provide better network management and get more resilient network systems [9]. In this paper, only those failure prediction and anomaly diagnosis that are closer to the proposed solution and that mainly inspired the approach followed are discussed.

In [6], finite-state machines (FSMs) are used to model correlations between network alarm sequences occurred during and before failures. Specifically, a probabilistic finite-state machine model is built by considering known network fault based on historical data collected during its occurrences.

In pattern matching approaches, online machine learning techniques are usually used to model the behavior of monitored data [5][6][14][9]. These approaches can be distinguished as symptoms monitoring techniques for predicting failures (as the proposed EHMM-CT does) and error monitoring mechanisms. Baldoni et al. [12] proposed an online prediction in safety-critical systems, where they monitored network traffic only to perform failure prediction, and the HMM was exploited to create a state recognizer. Salfner et al. [8] presented an error monitoring failure prediction technique that uses Hidden Semi-Markov Model (HSMM) to recognize error patterns, which can lead to failures. Hoffman et al. [6] proposed two error monitoring based approaches, one of which resorts to a Discrete Time Markov Model and the other which employs function.

As the cloud computing system evolves, most of the methods mentioned above require significant recalibration and "retraining". By continuously tracking the behavior of the system, the recalibration and failure detection become more automated, which provides vital support to autonomic fault management [7]. Therefore, based on these autonomic technologies, the EHMM-CT to profile the states of systems using historical runtime data is proposed. In the EHMM-CT, the system behavior is modeled as the hidden state and failure events as a part of the symptom set is also considered. A new training algorithm is proposed in this paper for our EHMM-CT, and then the trained model can be used to predict failure in cloud computing systems.

 

7. Conclusions

This work has shown the online failure prediction approach, EHMM-CT, in cloud computing systems. EHMM-CT is a novel extension of HMMs, which is appropriate for failure prediction in cloud computing systems. Approaches in CT are introduced to model training, which can reduce the amount and complexity of computation substantially. EHMM-CT aims to model system behavior using historical runtime data. It estimates the probability of the failure occurrence by system state matching. Its performance is evaluated via our extensive simulations. Results from these simulations have shown that our approach for predicting failure in cloud computing systems is effective. Moreover, the optimal tradeoff between failure prediction performance and computational complexity are provided, showing the excellent performance regarding this tradeoff.

Still many extensions and refinements of our approach are possible, in particular for a universal and adaptive threshold setting mechanism in cloud computing systems. Moreover, as a promising fault tolerance technique, a resource reallocation exploiting results of the failure prediction will be a part of our future investigations.

References

  1. R. Jhawar and V. Piuri, Computer and Information Security Handbook, 2nd Edition, Elsevier, Waltham, 2013. Article (CrossRef Link)
  2. Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo, “BlueGene/L failure analysis and prediction models,” in Proc. of IEEE Conf. on Dependable Systems and Networks, pp. 425-434, June, 2006. Article (CrossRef Link)
  3. X. Wang, H. Sun, T. Deng, and J. Huai, “On the tradeoff of availability and consistency for quorum systems in data center networks,” Computer Networks, vol. 76, pp. 191-206, January, 2015. Article (CrossRef Link) https://doi.org/10.1016/j.comnet.2014.11.006
  4. C. Liu, D. Li, Y. Du, and X. Han, “Some statistical analysis of the normal cloud model,” Information and Control, vol. 34, no. 2, pp. 236-239+248, 2005. Article (CrossRef Link)
  5. S. Fu and C.-Z. Xu, “Quantifying temporal and spatial correlation of failure events for proactive management,” in Proc. of 26th IEEE Int. Symposium on Reliable Distributed Systems, pp. 175-184, October, 2007. Article (CrossRef Link)
  6. G. A. Hoffmann, F. Salfner, and M. Malek, “Advanced failure prediction in complex software systems,” Technical Report, 2004. Article (CrossRef Link)
  7. R. Chaparadza, N. Tcholtchev, and V. Kaldanis, “How Autonomic Fault-Management Can Address Current Challenges in Fault-Management Faced in IT and Telecommunication Networks,” Access Networks, vol. 63, pp. 253-268, 2011. Article (CrossRef Link)
  8. F. Salfner and M. Malek, “Using hidden semi-markov models for effective online failure prediction,” in Proc. of 26th IEEE Int. Symposium on Reliable Distributed Systems, pp. 161-174, October, 2007. Article (CrossRef Link)
  9. A. K. Marnerides, A. Schaeffer-Filho, and A. Mauthe, “Traffic anomaly diagnosis in Internet backbone networks: A survey,” Computer Networks, vol. 73, pp. 224-243, November, 2014. Article (CrossRef Link) https://doi.org/10.1016/j.comnet.2014.08.007
  10. F. Salfner, “Modeling event-driven time series with generalized hidden semi-Markov models,” Technical Report, 2006. Article (CrossRef Link)
  11. D. Li and C. Liu, “Study on the universality of the normal cloud model,” Engineering Science, vol. 6, no. 8, pp. 28-34, 2004. Article (CrossRef Link)
  12. R. Baldoni, L. Montanari, and M. Rizzuto, “On-line failure prediction in safety-critical systems,” Future Generation Computer Systems, vol. 45, pp. 123-132, April, 2015. Article (CrossRef Link) https://doi.org/10.1016/j.future.2014.11.015
  13. H. S. Huang and R. C. Wang, “Subjective trust evaluation model based on membership cloud theory,” Journal of Communication, vol. 29, no. 4, pp.13-19, 2008. Article (CrossRef Link)
  14. P. Casas, J. Mazel, and P. Owezarski, “UNADA: Unsupervised network anomaly detection using sub-space outliers ranking,” NETWORKING, vol. 6640, pp. 40-51, May, 2011. Article (CrossRef Link)
  15. F. Salfner, M. Schieschke, and M. Malek, “Predicting failures of computer systems: A case study for a telecommunication system,” in Proc. of 20th IEEE Int. Symposium in Parallel and Distributed Processing, April, 2006. Article (CrossRef Link)
  16. H. L. Li, C. H. Guo, and W. R. Qiu,. “Similarity measurement between normal cloud models,” Acta Electronica Sinica, vol. 39, no. 11, pp. 2561-2567, 2011. Article (CrossRef Link)
  17. S. B. Zhang, C. X. Xu, and Y. J. An, “Study on the Risk Evaluation Approach Based on Cloud Model,” Chinese Journal of Computers, vol. 42, no. 1, pp. 92-68, 2013. Article (CrossRef Link)
  18. H. J. Abed, A. Al-Fuqaha, B. Khan, and A. Rayes, “Efficient failure prediction in autonomic networks based on trend and frequency analysis of anomalous patterns,” International Journal of Network Management, vol. 23, no. 3, pp. 186-213, 2013. Article (CrossRef Link) https://doi.org/10.1002/nem.1825