Browse > Article
http://dx.doi.org/10.3837/tiis.2016.09.004

EHMM-CT: An Online Method for Failure Prediction in Cloud Computing Systems  

Zheng, Weiwei (State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications)
Wang, Zhili (State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications)
Huang, Haoqiu (State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications)
Meng, Luoming (State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications)
Qiu, Xuesong (State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications)
Publication Information
KSII Transactions on Internet and Information Systems (TIIS) / v.10, no.9, 2016 , pp. 4087-4107 More about this Journal
Abstract
The current cloud computing paradigm is still vulnerable to a significant number of system failures. The increasing demand for fault tolerance and resilience in a cost-effective and device-independent manner is a primary reason for creating an effective means to address system dependability and availability concerns. This paper focuses on online failure prediction for cloud computing systems using system runtime data, which is different from traditional tolerance techniques that require an in-depth knowledge of underlying mechanisms. A 'failure prediction' approach, based on Cloud Theory (CT) and the Hidden Markov Model (HMM), is proposed that extends the HMM by training with CT. In the approach, the parameter ω is defined as the correlations between various indices and failures, taking into account multiple runtime indices in cloud computing systems. Furthermore, the approach uses multiple dimensions to describe failure prediction in detail by extending parameters of the HMM. The likelihood and membership degree computing algorithms in the CT are used, instead of traditional algorithms in HMM, to reduce computing overhead in the model training phase. Finally, the results from simulations show that the proposed approach provides very accurate results at low computational cost. It can obtain an optimal tradeoff between 'failure prediction' performance and computing overhead.
Keywords
Online failure prediction; cloud theory; hidden Markov model; cloud computing systems;
Citations & Related Records
연도 인용수 순위
  • Reference
1 R. Jhawar and V. Piuri, Computer and Information Security Handbook, 2nd Edition, Elsevier, Waltham, 2013. Article (CrossRef Link)
2 Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo, “BlueGene/L failure analysis and prediction models,” in Proc. of IEEE Conf. on Dependable Systems and Networks, pp. 425-434, June, 2006. Article (CrossRef Link)
3 X. Wang, H. Sun, T. Deng, and J. Huai, “On the tradeoff of availability and consistency for quorum systems in data center networks,” Computer Networks, vol. 76, pp. 191-206, January, 2015. Article (CrossRef Link)   DOI
4 C. Liu, D. Li, Y. Du, and X. Han, “Some statistical analysis of the normal cloud model,” Information and Control, vol. 34, no. 2, pp. 236-239+248, 2005. Article (CrossRef Link)
5 S. Fu and C.-Z. Xu, “Quantifying temporal and spatial correlation of failure events for proactive management,” in Proc. of 26th IEEE Int. Symposium on Reliable Distributed Systems, pp. 175-184, October, 2007. Article (CrossRef Link)
6 G. A. Hoffmann, F. Salfner, and M. Malek, “Advanced failure prediction in complex software systems,” Technical Report, 2004. Article (CrossRef Link)
7 R. Chaparadza, N. Tcholtchev, and V. Kaldanis, “How Autonomic Fault-Management Can Address Current Challenges in Fault-Management Faced in IT and Telecommunication Networks,” Access Networks, vol. 63, pp. 253-268, 2011. Article (CrossRef Link)
8 F. Salfner and M. Malek, “Using hidden semi-markov models for effective online failure prediction,” in Proc. of 26th IEEE Int. Symposium on Reliable Distributed Systems, pp. 161-174, October, 2007. Article (CrossRef Link)
9 A. K. Marnerides, A. Schaeffer-Filho, and A. Mauthe, “Traffic anomaly diagnosis in Internet backbone networks: A survey,” Computer Networks, vol. 73, pp. 224-243, November, 2014. Article (CrossRef Link)   DOI
10 F. Salfner, “Modeling event-driven time series with generalized hidden semi-Markov models,” Technical Report, 2006. Article (CrossRef Link)
11 D. Li and C. Liu, “Study on the universality of the normal cloud model,” Engineering Science, vol. 6, no. 8, pp. 28-34, 2004. Article (CrossRef Link)
12 R. Baldoni, L. Montanari, and M. Rizzuto, “On-line failure prediction in safety-critical systems,” Future Generation Computer Systems, vol. 45, pp. 123-132, April, 2015. Article (CrossRef Link)   DOI
13 H. S. Huang and R. C. Wang, “Subjective trust evaluation model based on membership cloud theory,” Journal of Communication, vol. 29, no. 4, pp.13-19, 2008. Article (CrossRef Link)
14 P. Casas, J. Mazel, and P. Owezarski, “UNADA: Unsupervised network anomaly detection using sub-space outliers ranking,” NETWORKING, vol. 6640, pp. 40-51, May, 2011. Article (CrossRef Link)
15 H. J. Abed, A. Al-Fuqaha, B. Khan, and A. Rayes, “Efficient failure prediction in autonomic networks based on trend and frequency analysis of anomalous patterns,” International Journal of Network Management, vol. 23, no. 3, pp. 186-213, 2013. Article (CrossRef Link)   DOI
16 F. Salfner, M. Schieschke, and M. Malek, “Predicting failures of computer systems: A case study for a telecommunication system,” in Proc. of 20th IEEE Int. Symposium in Parallel and Distributed Processing, April, 2006. Article (CrossRef Link)
17 H. L. Li, C. H. Guo, and W. R. Qiu,. “Similarity measurement between normal cloud models,” Acta Electronica Sinica, vol. 39, no. 11, pp. 2561-2567, 2011. Article (CrossRef Link)
18 S. B. Zhang, C. X. Xu, and Y. J. An, “Study on the Risk Evaluation Approach Based on Cloud Model,” Chinese Journal of Computers, vol. 42, no. 1, pp. 92-68, 2013. Article (CrossRef Link)