• Title/Summary/Keyword: Fault Tolerance System

Search Result 335, Processing Time 0.025 seconds

A study on the Design Techniques and Analysis of Fault-Tolerant Computers

  • Cho, Jai-Rip
    • Journal of Korean Society for Quality Management
    • /
    • v.21 no.1
    • /
    • pp.78-95
    • /
    • 1993
  • The art of designing and analyzing fault-tolerant computers is surveyed with special emphasis on problems of analyzing the behavior of computers that have autonomous repair capability. The survey covers the following topics : (1) general issues in computer reliability, (2) fault-tolerance state relations and requirements, (3) computational hierarchy, (4) fault characteristics, (5) fault diagnosis, (6) fault-tolerance schemes for logic network and machines, (7) fault-coverage effects, and (8) fault-tree analysis of coverage. This paper does not include techniques for verifying nonredundant hardware or system software designs or for verifying the correctness of application programs.

  • PDF

Multi-Agent System for Fault Tolerance in Wireless Sensor Networks

  • Lee, HwaMin;Min, Se Dong;Choi, Min-Hyung;Lee, DaeWon
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.3
    • /
    • pp.1321-1332
    • /
    • 2016
  • Wireless sensor networks (WSN) are self-organized networks that typically consist of thousands of low-cost, low-powered sensor nodes. The reliability and availability of WSNs can be affected by faults, including those from radio interference, battery exhaustion, hardware and software failures, communication link errors, malicious attacks, and so on. Thus, we propose a novel multi-agent fault tolerant system for wireless sensor networks. Since a major requirement of WSNs is to reduce energy consumption, we use multi-agent and mobile agent configurations to manage WSNs that provide energy-efficient services. Mobile agent architecture have inherent advantages in that they provide energy awareness, scalability, reliability, and extensibility. Our multi-agent system consists of a resource manager, a fault tolerance manager and a load balancing manager, and we also propose fault-tolerant protocols that use multi-agent and mobile agent setups.

Reliability Analysis for Train Control System by Software Fault Tolerance Techniques (소프트웨어 결함허용 기법에 의한 열차제어시스템 신뢰도 분석)

  • Suh, Seog-Chul;Lee, Jong-Woo
    • Journal of the Korean Society for Railway
    • /
    • v.12 no.6
    • /
    • pp.1043-1048
    • /
    • 2009
  • PES (Programmable Electronic System) is used by software development for the train control system. PES has been widely used in real world and consists of hardware, firmware and application software. The PES are easily apply to many applications because its implementation has high flexibility. Many safety critical functions are realized through software in safety critical system. Normally, it is difficult to detect failures for PES system because the PES is too sophisticated to identify sources of the failure. So, the reliability analysis is needed by using software fault tolerance techniques. Currently, there are the recovery block, distributed recovery block, N-version programming, N self-checking programming in fault tolerance techniques. In this paper, the models of recovery block and N-version programming in software fault tolerance techniques are suggested by using the Markov model. Also, the reliability in the train control system is analyzed through changing time. The fault occupancy rates of the program, adjustment test and voter are stationary. So, the relation between time and reliability is presented by using Matlab program. In the result of reliability, the reliability of recovery block is more high than N-version programming in case of the same number of substitution block.

Simulation-Based Fault Analysis for Resilient System-On-Chip Design

  • Han, Chang Yeop;Jeong, Yeong Seob;Lee, Seung Eun
    • Journal of information and communication convergence engineering
    • /
    • v.19 no.3
    • /
    • pp.175-179
    • /
    • 2021
  • Enhancing the reliability of the system is important for recent system-on-chip (SoC) designs. This importance has led to studies on fault diagnosis and tolerance. Fault-injection (FI) techniques are widely used to measure the fault-tolerance capabilities of resilient systems. FI techniques suffer from limitations in relation to environmental conditions and system features. Moreover, a hardware-based FI can cause permanent damage to the target system, because the actual circuit cannot be restored. Accordingly, we propose a simulation-based FI framework based on the Verilog Procedural Interface for measuring the failure rates of SoCs caused by soft errors. We execute five benchmark programs using an ARM Cortex M0 processor and inject soft errors using the proposed framework. The experiment has a 95% confidence level with a ±2.53% error, and confirms the reliability and feasibility of using proposed framework for fault analysis in SoCs.

Development of a Fault-tolerant IoT System Based on the EVENODD Method (EVENODD 방법 기반 결함허용 사물인터넷 시스템 개발)

  • Woo, Min-Woo;Park, KeeHyun;An, Donghyeok
    • Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology
    • /
    • v.7 no.3
    • /
    • pp.263-272
    • /
    • 2017
  • The concept of Internet of Things (IoT) has been increasingly popular these days, and its areas of application have been broadened. However, if the data stored in an IoT system is damaged and cannot be recovered, our society would suffer considerable damages and chaos. Thus far, most of the studies on fault-tolerance have been focused on computer systems, and there has not been much research on fault-tolerance for IoT systems. In this study, therefore, a fault-tolerance method in IoT environments is proposed. In other words, based on the EVENODD method, one of the traditional fault-tolerance methods, a fault-tolerance storage and recovery method for the data stored in the IoT server is proposed, and the method is implemented on an oneM2M IoT system. The fault-tolerance method proposed in this paper consists of two phases - fault-tolerant data storage and recovery. In the fault-tolerant data storage phase, some F-T gateways are designated and fault-tolerant data are distributed in the F-T gateways' storage using the EVENODD method. In the fault-tolerant recovery phase, the IoT server initiates the recovery procedure after it receives fault-tolerant data from non-faulty F-T gateways. In other words, an EVENODD array is reconstructed and received data are merged to obtain the original data.

Fault-Free Process for IT System with TRM(Technical Reference Model) based Fault Check Point and Event Rule Engine (기술분류체계 기반의 장애 점검포인트와 이벤트 룰엔진을 적용한 무장애체계 구현)

  • Hyun, Byeong-Tag;Kim, Tae-Woo;Um, Chang-Sup;Seo, Jong-Hyen
    • Information Systems Review
    • /
    • v.12 no.3
    • /
    • pp.1-17
    • /
    • 2010
  • IT Systems based on Global Single Instance (GSI) can manage a corporation's internal information, resources and assets effectively and raise business efficiency through consolidation of their business process and productivity. But, It has also dangerous factor that IT system fault failure can cause a state of paralysis of a business itself, followed by huge loss of money. Many of studies have been conducted about fault-tolerance based on using redundant component. The concept of fault tolerance is rather simple but, designing and adopting fault-tolerance system is not easy due to uncertainty of a type and frequency of faults. So, Operational fault management that working after developed IT system is important more and more along with technical fault management. This study proposes the fault management process that including a pre-estimation method using TRM (Technical Reference Model) check point and event rule engine. And also proposes a effect of fault-free process through built fault management system to representative company of Hi-tech industry. After adopting fault-free process, a number of failure decreased by 46%, a failure time decreased by 56% and the Opportunity loss costs decreased by 77%.

Comparative Study of the System Operational Method for Fault-Tolernace (Fault-Tolerance를 위한 시스템의 동작방식에 대한 비교 연구)

  • 양성현;이기서
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.17 no.11
    • /
    • pp.1279-1289
    • /
    • 1992
  • Fault-tolerant system in improved the reliability and safety by using hardware and software redundancy. Fault mask and detection, identification techniques are conditionally used with system's application areas. Here DMR system is operated with standby and fail-safe module method that has minimal hardware and software redundancy, then its reliablity and safety comparison is presented respectively. Also this paper proposed an effective methods of dealing with transient faults as compared system's MTTFs to transient faults tolerance capabilities of self-diagnosis program.

  • PDF

Service Deployment Strategy for Customer Experience and Cost Optimization under Hybrid Network Computing Environment

  • Ning Wang;Huiqing Wang;Xiaoting Wang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.11
    • /
    • pp.3030-3049
    • /
    • 2023
  • With the development and wide application of hybrid network computing modes like cloud computing, edge computing and fog computing, the customer service requests and the collaborative optimization of various computing resources face huge challenges. Considering the characteristics of network environment resources, the optimized deployment of service resources is a feasible solution. So, in this paper, the optimal goals for deploying service resources are customer experience and service cost. The focus is on the system impact of deploying services on load, fault tolerance, service cost, and quality of service (QoS). Therefore, the alternate node filtering algorithm (ANF) and the adjustment factor of cost matrix are proposed in this paper to enhance the system service performance without changing the minimum total service cost, and corresponding theoretical proof has been provided. In addition, for improving the fault tolerance of system, the alternate node preference factor and algorithm (ANP) are presented, which can effectively reduce the probability of data copy loss, based on which an improved cost-efficient replica deployment strategy named ICERD is given. Finally, by simulating the random occurrence of cloud node failures in the experiments and comparing the ICERD strategy with representative strategies, it has been validated that the ICERD strategy proposed in this paper not only effectively reduces customer access latency, meets customers' QoS requests, and improves system service quality, but also maintains the load balancing of the entire system, reduces service cost, enhances system fault tolerance, which further confirm the effectiveness and reliability of the ICERD strategy.

Practical Swarm Optimization based Fault-Tolerance Algorithm for the Internet of Things

  • Luo, Shiliang;Cheng, Lianglun;Ren, Bin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.3
    • /
    • pp.735-748
    • /
    • 2014
  • The fault-tolerance routing problem is one of the most important issues in the application of the Internet of Things, and has been attracting growing research interests. In order to maintain the communication paths from source sensors to the macronodes, we present a hybrid routing scheme and model, in which alternate paths are created once the previous routing is broken. Then, we propose an improved efficient and intelligent fault-tolerance algorithm (IEIFTA) to provide the fast routing recovery and reconstruct the network topology for path failure in the Internet of Things. In the IEIFTA, mutation direction of the particle is determined by multi-swarm evolution equation, and its diversity is improved by the immune mechanism, which can improve the ability of global search and improve the converging rate of the algorithm. The simulation results indicate that the IEIFTA-based fault-tolerance algorithm outperforms the EARQ algorithm and the SPSOA algorithm due to its ability of fast routing recovery mechanism and prolonging the lifetime of the Internet of Things.

Practical Swarm Optimization based Fault-Tolerance Algorithm for the Internet of Things

  • Luo, Shiliang;Cheng, Lianglun;Ren, Bin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.4
    • /
    • pp.1178-1191
    • /
    • 2014
  • The fault-tolerance routing problem is one of the most important issues in the application of the Internet of Things, and has been attracting growing research interests. In order to maintain the communication paths from source sensors to the macronodes, we present a hybrid routing scheme and model, in which alternate paths are created once the previous routing is broken. Then, we propose an improved efficient and intelligent fault-tolerance algorithm (IEIFTA) to provide the fast routing recovery and reconstruct the network topology for path failure in the Internet of Things. In the IEIFTA, mutation direction of the particle is determined by multi-swarm evolution equation, and its diversity is improved by the immune mechanism, which can improve the ability of global search and improve the converging rate of the algorithm. The simulation results indicate that the IEIFTA-based fault-tolerance algorithm outperforms the EARQ algorithm and the SPSOA algorithm due to its ability of fast routing recovery mechanism and prolonging the lifetime of the Internet of Things.