• Title/Summary/Keyword: fault tolerant computing

Search Result 67, Processing Time 0.029 seconds

A Fault-Tolerant Linear System Solver in a Standard MPI Environment (표준 MPI 환경에서의 무정지형 선형 시스템 해법)

  • Park, Pil-Seong
    • Journal of Internet Computing and Services
    • /
    • v.6 no.6
    • /
    • pp.23-34
    • /
    • 2005
  • In a large scale parallel computation, failures of some nodes or communication links end up with waste of computing resources, Several fault-tolerant MPI libraries have been proposed so far, but the programs written by using such libraries have a portability problem since fault-tolerant features are not supported by the MPI standard yet, In this paper, we propose an application-level fault-tolerant linear system solver that uses the asynchronous iteration algorithm and the standard MPI functions only, which does not have a portability problem and is more efficient by adopting a simplified recovery mechanism.

  • PDF

Design and Cost Analysis for a Fault-Tolerant Distributed Shared Memory System

  • Jazi, AL-Harbi Fahad;kim, Kangseok;Kim, Jai-Hoon
    • Journal of Internet Computing and Services
    • /
    • v.17 no.4
    • /
    • pp.1-9
    • /
    • 2016
  • Algorithms implementing distributed shared memory (DSM) were developed for ensuring consistency. The performance of DSM algorithms is dependent on system and usage parameters. However, ensuring these algorithms to tolerate faults is a problem that needs to be researched. In this study, we proposed fault-tolerant scheme for DSM system and analyzed reliability and fault-tolerant overhead. Using our analysis, we can choose a proper algorithm for DSM on error prone environment.

A Fault-tolerant Inertial Navigation System for UAVs Based on Partition Computing (파티션 컴퓨팅 기반의 무인기 고장 감내 관성 항법 시스템)

  • Jung, Byeongyong;Kim, Jungguk
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.1
    • /
    • pp.29-39
    • /
    • 2015
  • When new inertial navigation systems for an unmanned aerial vehicles are being developed and tested, construction of a fault-tolerant system is required because of various types of hazards caused by S/W and H/W faults. In this paper, a new fault-tolerant flight system that can be deployed into one or more FCCs (Flight Control Computers) is introduced, based on a partition scheme wherein each OFP (Operational Flight Program) partition uses an independent CPU and memory slot. The new fault-tolerant navigation system utilizes one or two FCCs, and executes a primary navigation OFP under development and a stable shadow OFP partition on each node. The fault-tolerant navigation system based on a single FCC can be used for UAVs with small payloads. For larger UAVs, an additional FCC with two OFP partitions can be used to provide both H/W and S/W fault-tolerance. The developed fault-tolerant navigation system significantly removes various hazards in testing new navigation S/Ws for UAVs.

A Fault Tolerant Data Management Scheme for Healthcare Internet of Things in Fog Computing

  • Saeed, Waqar;Ahmad, Zulfiqar;Jehangiri, Ali Imran;Mohamed, Nader;Umar, Arif Iqbal;Ahmad, Jamil
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.1
    • /
    • pp.35-57
    • /
    • 2021
  • Fog computing aims to provide the solution of bandwidth, network latency and energy consumption problems of cloud computing. Likewise, management of data generated by healthcare IoT devices is one of the significant applications of fog computing. Huge amount of data is being generated by healthcare IoT devices and such types of data is required to be managed efficiently, with low latency, without failure, and with minimum energy consumption and low cost. Failures of task or node can cause more latency, maximum energy consumption and high cost. Thus, a failure free, cost efficient, and energy aware management and scheduling scheme for data generated by healthcare IoT devices not only improves the performance of the system but also saves the precious lives of patients because of due to minimum latency and provision of fault tolerance. Therefore, to address all such challenges with regard to data management and fault tolerance, we have presented a Fault Tolerant Data management (FTDM) scheme for healthcare IoT in fog computing. In FTDM, the data generated by healthcare IoT devices is efficiently organized and managed through well-defined components and steps. A two way fault-tolerant mechanism i.e., task-based fault-tolerance and node-based fault-tolerance, is provided in FTDM through which failure of tasks and nodes are managed. The paper considers energy consumption, execution cost, network usage, latency, and execution time as performance evaluation parameters. The simulation results show significantly improvements which are performed using iFogSim. Further, the simulation results show that the proposed FTDM strategy reduces energy consumption 3.97%, execution cost 5.09%, network usage 25.88%, latency 44.15% and execution time 48.89% as compared with existing Greedy Knapsack Scheduling (GKS) strategy. Moreover, it is worthwhile to mention that sometimes the patients are required to be treated remotely due to non-availability of facilities or due to some infectious diseases such as COVID-19. Thus, in such circumstances, the proposed strategy is significantly efficient.

Dynamic Redundancy-based Fault-Recovery Scheme for Reliable CGRA-based Multi-Core Architecture

  • Kim, Yoonjin;Sohn, Seungyeon
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.15 no.6
    • /
    • pp.615-628
    • /
    • 2015
  • CGRA (Coarse-Grained Reconfigurable Architecture) based multi-core architecture can be considered as a suitable solution for the fault-tolerant computing. However, there have been a few research projects based on fault-tolerant CGRA without exploiting the strengths of CGRA as well as their works are limited to single CGRA. Therefore, in this paper, we propose two approaches to enable exploiting the inherent redundancy and reconfigurability of the multi-CGRA for fault-recovery. One is a resilient inter-CGRA fabric that is ring-based sharing fabric (RSF) with minimal interconnection overhead. Another is a novel intra/inter-CGRA reconfiguration technique on RSF for maximizing utilization of the resources when faults occur. Experimental results show that the proposed approaches achieve up to 94% faulty recoverability with reducing area/delay/power by up to 15%/28.6%/31% when compared with completely connected fabric (CCF).

Technology Trends of Fault-tolerant Quantum Computing (결함허용 양자컴퓨팅 시스템 기술 연구개발 동향)

  • Hwang, Y.;Kim, T.W.;Baek, C.H.;Cho, S.U.;Kim, H.S.;Choi, B.S.
    • Electronics and Telecommunications Trends
    • /
    • v.37 no.2
    • /
    • pp.1-10
    • /
    • 2022
  • Similar to present computers, quantum computers comprise quantum bits (qubits) and an operating system. However, because the quantum states are fragile, we need to correct quantum errors using entangled physical qubits with quantum error correction (QEC) codes. The combination of entangled physical qubits with a QEC protocol and its computational model are called a logical qubit and fault-tolerant quantum computation, respectively. Thus, QEC is the heart of fault-tolerant quantum computing and overcomes the limitations of noisy intermediate-scale quantum computing. Therefore, in this study, we briefly survey the status of QEC codes and the physical implementation of logical qubit over various qubit technologies. In summary, we emphasize 1) the error threshold value of a quantum system depends on the configurations and 2) therefore, we cannot set only any specific theoretical and/or physical experiment suggestion.

Optimal Fault-Tolerant Resource Placement in Parallel and Distributed Systems (병렬 및 분산 시스템에서의 최적 고장 허용 자원 배치)

  • Kim, Jong-Hoon;Lee, Cheol-Hoon
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.6
    • /
    • pp.608-618
    • /
    • 2000
  • We consider the problem of placing resources in a distributed computing system so that certain performance requirements may be met while minimizing the number of required resource copies, irrespective of node or link failures. To meet the requirements for high performance and high availability, minimum number of resource copies should be placed in such a way that each node has at least two copies on the node or its neighbor nodes. This is called the fault-tolerant resource placement problem in this paper. The structure of a parallel or a distributed computing system is represented by a graph. The fault-tolerant placement problem is first transformed into the problem of finding the smallest fault-tolerant dominating set in a graph. The dominating set problem is known to be NP-complete. In this paper, searching for the smallest fault-tolerant dominating set is formulated as a state-space search problem, which is then solved optimally with the well-known A* algorithm. To speed up the search, we derive heuristic information by analyzing the properties of fault-tolerant dominating sets. Some experimental results on various regular and random graphs show that the search time can be reduced dramatically using the heuristic information.

  • PDF

The Performance Comparison of Low-Overhead Fault Tolerant Services based on Distributed Object (분산객체 기반 경량화 결함허용 기술의 성능 비교)

  • Kim, Shik;Hyun, Mu-Yong
    • The Journal of Information Technology
    • /
    • v.9 no.4
    • /
    • pp.25-34
    • /
    • 2006
  • As most application programs are more sophisticated and are adopted the distributed object technology, the object based distributed design became widespread since it supports portability and reusability. The approaches for fault-tolerant distributed computing are categorized into the active replica mechanism for mission-critical application programs and the passive replica mechanism for non mission-critical ones, when fault-tolerant facilities are added on. Our paper introduces the pros and drawbacks of several approaches for the add-on low-overhead fault-tolerant services by the surveys and shows the results of experiments for bench-mark models in order to demonstrate their performance.

  • PDF

Hierarchical Multiplexing Interconnection Structure for Fault-Tolerant Reconfigurable Chip Multiprocessor

  • Kim, Yoon-Jin
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.11 no.4
    • /
    • pp.318-328
    • /
    • 2011
  • Stage-level reconfigurable chip multiprocessor (CMP) aims to achieve highly reliable and fault tolerant computing by using interwoven pipeline stages and on-chip interconnect for communicating with each other. The existing crossbar-switch based stage-level reconfigurable CMPs offer high reliability at the cost of significant area/power overheads. These overheads make realizing large CMPs prohibitive due to the area and power consumed by heavy interconnection networks. On other hand, area/power-efficient architectures offer less reliability and inefficient stage-level resource utilization. In this paper, I propose a hierarchical multiplexing interconnection structure in lieu of crossbar interconnect to design area/power-efficient stage-level reconfigurable CMP. The proposed approach is able to keep the reliability offered by the crossbar-switch while reducing the area and power overheads. Experimental results show that the proposed approach reduces area by up to 21% and power by up to 32% when compared with the crossbar-switch based interconnection network.