• Title/Summary/Keyword: Fault-tolerance

Search Result 570, Processing Time 0.023 seconds

Design and Implementation of a User-based MPI Checkpointer for Portability (이식성을 고려한 사용자기반 MPI 체크포인터의 설계 및 구현)

  • Ahn Sun-Il;Han Sang-Yong
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.1_2
    • /
    • pp.35-43
    • /
    • 2006
  • An MPI Checkpointer is a tool which provides fault-tolerance through checkpointing The previous researches related to the MPI checkpointer have focused on automatic checkpointing and recovery capabilities, but they haven't considered portability issues. In this paper, we discuss design and implementation issues considered for portability when we developed an MPI checkpointer called STFT. In order to increase portability, firstly STFT supports the abstraction interface for a single process checkpointer. Secondly, STFT uses a user-based checkpointing method, and limits possible checkpointing places a user can make. Thirdly, STFT lets the MPI_Init create network connections to the other MPI processes in a fixed order. With these features, we expect STFT can be easily adaptable to various platforms and MPI implementations, and confirmed STFT is easily adaptable to LAM and MPICH/P4 with the prototype Implementation.

An Approach to Software Analysis and Design based on Distributed Components (분산 컴포넌트 기반의 소프트웨어 분석 및 설계 방법)

  • Choi, You-Hee;Yeom, Keun-Hyuk
    • Journal of KIISE:Software and Applications
    • /
    • v.28 no.12
    • /
    • pp.896-909
    • /
    • 2001
  • Recently, above 50 percentages of software are being developed based on distributed application platforms. And recent technologies such as EJB(Enterprise Java Beans)[1]COM(Component Object Model)[2] CORBA(Common Object Request Broker Architecture)[3] have been advanced for distributed component-based software development . Therefore a systematic development process is necessary to develop component based applications using distributed application platforms. However, most of component-base software development processes do not define concrete flows between tasks and relationships among artifacts of each task Also, distribution issues are not considered explicitly in most of component-based software development In this paper, we present an approach to analyze and design software based on distributed components. In this approach, we propose systematic guidelines for developing a software based on Unified process and the relationships among artifacts which are produced, Also we explicitly consider the distribution issues such as performance, fault tolerance, security, distributed transaction of CORBA environments.

  • PDF

Determining Checkpoint Intervals of Non-Preemptive Rate Monotonic Scheduling Using Probabilistic Optimization (확률 최적화를 이용한 비선점형 Rate Monotonic 스케줄링의 체크포인트 구간 결정)

  • Kwak, Seong-Woo;Yang, Jung-Min
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.21 no.1
    • /
    • pp.120-127
    • /
    • 2011
  • Checkpointing is one of common methods of realizing fault-tolerance for real-time systems. This paper presents a scheme to determine checkpoint intervals using probabilistic optimization. The considered real-time systems comprises multiple tasks in which transient faults can happen with a Poisson distribution. Also, multi-tasks are scheduled by the non-preemptive Rate Monotonic (RM) algorithm. In this paper, we present an optimization problem where the probability of task completion is described by checkpoint numbers. The solution to this problem is the optimal set of checkpoint numbers and intervals that maximize the probability. The probability computation includes schedulability test for the non-preemptive RM algorithm with respect to given numbers of checkpoint re-execution. A case study is given to show the applicability of the proposed scheme.

Low-Cost Causal Message Logging based Recovery Algorithm Considering Asynchronous Checkpointing (비동기적 검사점 기록을 고려한 저 비용 인과적 메시지 로깅 기반 회복 알고리즘)

  • Ahn, Jin-Ho;Bang, Seong-Jun
    • The KIPS Transactions:PartA
    • /
    • v.13A no.6 s.103
    • /
    • pp.525-532
    • /
    • 2006
  • Compared with the previous recovery algorithms for causal message logging, Elnozahy's recovery algerian considerably reduces the number of stable storage accesses and enables live processes to execute their computations continuously while performing its recovery procedure. However, if causal message logging is used with asynchronous checkpointing, the state of the system may be inconsistent after having executed this algorithm in case of concurrent failures. In this paper, we show these inconsistent cases and propose a low-cost recovery algorithm for causal message logging to solve the problem. To ensure the system consistency, this algorithm allows the recovery leader to obtain recovery information from not only the live processes, but also the other recovering processes. Also, the proposed algorithm requires no extra message compared with Elnozahy's one and its additional overhead incurred by message piggybacking is significantly low. To demonstrate this, simulation results show that the first only increases about 1.0%$\sim$2.1% of the recovery information collection time compared with the latter.

Design and Implementation of a Grid System META for Executing CFD Analysis Programs on Distributed Environment (분산 환경에서 CFD 분석 프로그램 수행을 위한 그리드 시스템 META 설계 및 구현)

  • Kang, Kyung-Woo;Woo, Gyun
    • The KIPS Transactions:PartA
    • /
    • v.13A no.6 s.103
    • /
    • pp.533-540
    • /
    • 2006
  • This paper describes the design and implementation of a grid system META (Metacomputing Environment using Test-run of Application) which facilitates the execution of a CFD (Computational Fluid Dynamics) analysis program on distributed environment. The grid system META allows the CFD program developers can access the computing resources distributed over the network just like one computer system. The research issues involved in the grid computing include fault-tolerance, computing resource selection, and user-interface design. In this paper, we exploits an automatic resource selection scheme for executing the parallel SPMD (Single Program Multiple Data) application written in MPI (Message Passing Interface). The proposed resource selection scheme is informed from the network latency time and the elapsed time of the kernel loop attained from test-run. The network latency time highly influences the executional performance when a parallel program is distributed and executed over several systems. The elapsed time of the kernel loop can be used as an estimator of the whole execution time of the CFD Program due to a common characteristic of CFD programs. The kernel loop consumes over 90% of the whole execution time of a CFD program.

Large Scale Failure Adaptive Routing Protocol for Wireless Sensor Networks (무선 센서 네트워크를 위한 대규모 장애 적응적 라우팅 프로토콜)

  • Lee, Joa-Hyoung;Seon, Ju-Ho;Jung, In-Bum
    • The KIPS Transactions:PartA
    • /
    • v.16A no.1
    • /
    • pp.17-26
    • /
    • 2009
  • Large-scale wireless sensor network are expected to play an increasingly important role for the data collection in harmful area. However, the physical fragility of sensor node makes reliable routing in harmful area a challenging problem. Since several sensor nodes in harmful area could be damaged all at once, the network should have the availability to recover routing from node failures in large area. Many routing protocols take accounts of failure recovery of single node but it is very hard these protocols to recover routing from large scale failures. In this paper, we propose a routing protocol, which we refer to as LSFA, to recover network fast from failures in large area. LSFA detects the failure by counting the packet loss from parent node and in case of failure detection LSFAdecreases the routing interval to notify the failure to the neighbor nodes. Our experimental results indicate clearly that LSFA could recover large area failures fast with less packets than previous protocols.

Pub/Sub-based Sensor virtualization framework for Cloud environment

  • Ullah, Mohammad Hasmat;Park, Sung-Soon;Nob, Jaechun;Kim, Gyeong Hun
    • International journal of advanced smart convergence
    • /
    • v.4 no.2
    • /
    • pp.109-119
    • /
    • 2015
  • The interaction between wireless sensors such as Internet of Things (IoT) and Cloud is a new paradigm of communication virtualization to overcome resource and efficiency restriction. Cloud computing provides unlimited platform, resources, services and also covers almost every area of computing. On the other hand, Wireless Sensor Networks (WSN) has gained attention for their potential supports and attractive solutions such as IoT, environment monitoring, healthcare, military, critical infrastructure monitoring, home and industrial automation, transportation, business, etc. Besides, our virtual groups and social networks are in main role of information sharing. However, this sensor network lacks resource, storage capacity and computational power along with extensibility, fault-tolerance, reliability and openness. These data are not available to community groups or cloud environment for general purpose research or utilization yet. If we reduce the gap between real and virtual world by adding this WSN driven data to cloud environment and virtual communities, then it can gain a remarkable attention from all over, along with giving us the benefit in various sectors. We have proposed a Pub/Sub-based sensor virtualization framework Cloud environment. This integration provides resource, service, and storage with sensor driven data to the community. We have virtualized physical sensors as virtual sensors on cloud computing, while this middleware and virtual sensors are provisioned automatically to end users whenever they required. Our architecture provides service to end users without being concerned about its implementation details. Furthermore, we have proposed an efficient content-based event matching algorithm to analyze subscriptions and to publish proper contents in a cost-effective manner. We have evaluated our algorithm which shows better performance while comparing to that of previously proposed algorithms.

Malfunction Measures and Susceptibility test of Elevator Based on EMS(Electromagnetic Susceptibility) Standard (EMS 규정에 따른 승강기 내성시험 및 오동작 대책에 관한 연구)

  • Kim, Gi-Hyun;Bae, Suk-Myong;Lee, Joo-Hwan
    • Journal of the Korean Institute of Illuminating and Electrical Installation Engineers
    • /
    • v.21 no.2
    • /
    • pp.78-85
    • /
    • 2007
  • The malfunction accidents such as kept within elevator, sudden rise, sudden stop, error of level indication which can bring about uneasiness of elevator passenger and be related with life accident are occurring. However, as there is not field recurrence for this portion and they are accidents that occur and then disappear, it is actually difficult to confirm the reason of accident. Therefore, we made an experiment for tolerance in three models which is being built recently according to EN12016(2004) standard and analyzed the movement characteristics of elevator to study the reason which can bring about confine and malfunction. Furthermore after supplementing measures for malfunction, Finally this paper will be used as reference to suggest methods for malfunction of elevator facility and the analysis for mutual relation between Power Quality and malfunction and fault of elevator.

Design and Implementation of iATA-based RAID5 Distributed Storage Servers (iATA 기반의 RAID5 분산 스토리지 서버의 설계 및 구현)

  • Ong, Ivy;Lim, Hyo-Taek
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.14 no.2
    • /
    • pp.305-311
    • /
    • 2010
  • iATA (Internet Advanced Technology Attachment) is a block-level protocol developed to transfer ATA commands over TCP/IP network, as an alternative network storage solution to address insufficient storage problem in mobile devices. This paper employs RAID5 distributed storage servers concept into iATA, in which the idea behind is to combine several machines with relatively inexpensive disk drives into a server array that works as a single virtual storage device, thus increasing the reliability and speed of operations. In the case of one machine failed, the server array will not destroy immediately but able to function in a degradation mode. Meanwhile, information can be easily recovered by using boolean exclusive OR (XOR) logical function with the bit information on the remaining machines. We perform I/O measurement and benchmark tool result indicates that additional fault tolerance feature does not delay read/write operations with reasonable file size ranged in 4KB-2MB, yet higher data integrity objective is achieved.

A System Design for Real-Time Monitoring of Patient Waiting Time based on Open-Source Platform (오픈소스 플랫폼 기반의 실시간 환자 대기시간 모니터링 시스템 설계)

  • Ryu, Wooseok
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.4
    • /
    • pp.575-580
    • /
    • 2018
  • This paper discusses system for real-time monitoring of patient waiting time in hospitals based on open-source platform. It is necessary to make use of open-source projects to develop a high-performance stream processing system, which analyzes and processes stream data in real time, with less cost. The Hadoop ecosystem is a well-known big data processing platform consisting of numerous open-source subprojects. This paper first defines several requirements for the monitoring system, and selects a few projects from the Hadoop ecosystem that are suited to meet the requirements. Then, the paper proposes system architecture and a detailed module design using Apache Spark, Apache Kafka, and so on. The proposed system can reduce development costs by using open-source projects and by acquiring data from legacy hospital information system. High-performance and fault-tolerance of the system can also be achieved through distributed processing.