• Title/Summary/Keyword: checkpointing

Search Result 72, Processing Time 0.022 seconds

Replicated Chaeckpointing Failure Recovery Schemes for Mobile Hosts and Mobile Support Station in Cellular Networks (셀룰라 네트워크 환경에서의 이중화 체크포인팅을 이용한 이동 호스트 및 기지국 결함 복구 기법)

  • Byun, Kyue-Sub;Kim, Jai-Hoon
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.27 no.1B
    • /
    • pp.13-23
    • /
    • 2002
  • A mobile host is prone to failure due to lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life on the wireless network. Many researchers have studied to overcome these problems. For high level Availability in the cellular networks, it is necessary to consider recovery from the failures of mobile support stations as well as mobile as mobile hosts. In this paper, we present modified trickle scheme for recovery from failures of Mobile Support Station based on checkpointing scheme and analyze and compare the performance. We propose and analyze the performance of two schemes : one is waiting recovery scheme for the mobile support station having the last checkpoint and the other is searching the new path to the another mobile support station having the checkpoint.

An Error Detection and Recovery System based on Multimedia Computer Supported Cooperative Work (멀티미이어 협동 작업환경에서의 오류 감지 및 복구 시스템)

  • Ko, Eung-Nam;Hwang, Dae-Joon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.5
    • /
    • pp.1330-1340
    • /
    • 2000
  • Multimedia isn ow applied to various real world areas. In particular, the focus on multimedia system and CSCW(Computer Supported Cooperative Work) has increased. In spite of this current trend, however, the study of fault tolerance for CSCW has not yet fully progressed. We propose EDR_MSCW. It is a system that is suitable for detecting ad recovering software error based on multimedia computer supported cooperative work as DOORAE by using software techniques. DOORAE is a framwork for supporting development on multimedia applications for computer-based collaborative works. When an error occurs, EDR_MCSCW detects an error by using hooking methods in MS-Windows API(Application Program Interface) function. If an error is found, we present a checkpointing and recovery algorithm which has the removal function of the domino-effect for recovering multimedia and CSCW by using stack.

  • PDF

A Fault-tolerant Scheme for Clustering Routing Protocols (클러스터 기반 라우팅 프로토콜을 위한 결함허용기법)

  • Min, Hong;Kim, Bong-Jae;Jung, Jin-Man;Kim, Seuk-Hyun;Yoon, Jin-Hyuk;Cho, Yoo-Kun;Heo, Jun-Young;Yi, Sang-Ho;Hong, Ji-Man
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.6
    • /
    • pp.668-672
    • /
    • 2010
  • In wireless sensor networks, a fault-tolerant scheme that detects the failure of sensor nodes and improves the reliability of collected information must be considered. Resource-constraint sensor nodes expose vulnerability and cannot use existing checkpointing schemes that do not consider a feature of sensor networks. In this paper, we propose a fault-tolerant scheme for clustering routing protocols that support the recovery of a head node.

DJFS: Providing Highly Reliable and High-Performance File System with Small-Sized NVRAM

  • Kim, Junghoon;Lee, Minho;Song, Yongju;Eom, Young Ik
    • ETRI Journal
    • /
    • v.39 no.6
    • /
    • pp.820-831
    • /
    • 2017
  • File systems and applications try to implement their own update protocols to guarantee data consistency, which is one of the most crucial aspects of computing systems. However, we found that the storage devices are substantially under-utilized when preserving data consistency because they generate massive storage write traffic with many disk cache flush operations and force-unit-access (FUA) commands. In this paper, we present DJFS (Delta-Journaling File System) that provides both a high level of performance and data consistency for different applications. We made three technical contributions to achieve our goal. First, to remove all storage accesses with disk cache flush operations and FUA commands, DJFS uses small-sized NVRAM for a file system journal. Second, to reduce the access latency and space requirements of NVRAM, DJFS attempts to journal compress the differences in the modified blocks. Finally, to relieve explicit checkpointing overhead, DJFS aggressively reflects the checkpoint transactions to file system area in the unit of the specified region. Our evaluation on TPC-C SQLite benchmark shows that, using our novel optimization schemes, DJFS outperforms Ext4 by up to 64.2 times with only 128 MB of NVRAM.

Efficient Checkpoint Algorithm for Message-Passing Parallel Applications on Cloud Computing (클라우드컴퓨팅에서 메시지패싱방식 응용프로그램의 효율적인 체크포인트 알고리즘)

  • Le, Duc Tai;Dao, Manh Thuong Quan;Ahn, Min-Joon;Choo, Hyun-Seung
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.04a
    • /
    • pp.156-157
    • /
    • 2011
  • In this work, we study the checkpoint/restart problem for message-passing parallel applications running on cloud computing environment. This is a new direction which arises from the trend of enabling the applications to run on the cloud computing environment. The main objective is to propose an efficient checkpoint algorithm for message-passing parallel applications considering communications with external systems. We further implement the novel algorithm by modifying gSOAP and OpenMPI (the open source libraries) which support service calls and checkpoint message-passing parallel programs, especially. The simulation showed that additional costs to the executing and checkpointing application of the algorithm are negligible. Ultimately, the algorithm supports efficiently the checkpoint/restart service for message-passing parallel applications, that send requests to external services.

Enhancing Dependability of Systems by Exploiting Storage Class Memory (스토리지 클래스 메모리를 활용한 시스템의 신뢰성 향상)

  • Kim, Hyo-Jeen;Noh, Sam-H.
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.37 no.1
    • /
    • pp.19-26
    • /
    • 2010
  • In this paper, we adopt Storage Class Memory, which is next-generation non-volatile RAM technology, as part of main memory parallel to DRAM, and exploit the SCM+DRAM main memory system from the dependability perspective. Our system provides instant system on/off without bootstrapping, dynamic selection of process persistence or non-persistence, and fast recovery from power and/or software failure. The advantages of our system are that it does not cause the problems of checkpointing, i.e., heavy overhead and recovery delay. Furthermore, as the system enables full application transparency, our system is easily applicable to real-world environments. As proof of the concept, we implemented a system based on a commodity Linux kernel 2.6.21 operating system. We verify that the persistence enabled processes continue to execute instantly at system off-on without any state and/or data loss. Therefore, we conclude that our system can improve availability and reliability.

Determining Checkpoint Intervals of Non-Preemptive Rate Monotonic Scheduling Using Probabilistic Optimization (확률 최적화를 이용한 비선점형 Rate Monotonic 스케줄링의 체크포인트 구간 결정)

  • Kwak, Seong-Woo;Yang, Jung-Min
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.21 no.1
    • /
    • pp.120-127
    • /
    • 2011
  • Checkpointing is one of common methods of realizing fault-tolerance for real-time systems. This paper presents a scheme to determine checkpoint intervals using probabilistic optimization. The considered real-time systems comprises multiple tasks in which transient faults can happen with a Poisson distribution. Also, multi-tasks are scheduled by the non-preemptive Rate Monotonic (RM) algorithm. In this paper, we present an optimization problem where the probability of task completion is described by checkpoint numbers. The solution to this problem is the optimal set of checkpoint numbers and intervals that maximize the probability. The probability computation includes schedulability test for the non-preemptive RM algorithm with respect to given numbers of checkpoint re-execution. A case study is given to show the applicability of the proposed scheme.

Compound Backup Technique using Hot-Cold Data Classification in the Distributed Memory System (분산메모리시스템에서의 핫콜드 데이터 분류를 이용한 복합 백업 기법)

  • Kim, Woo Chur;Min, Dong Hee;Hong, Ji Man
    • Smart Media Journal
    • /
    • v.4 no.3
    • /
    • pp.16-23
    • /
    • 2015
  • As the IT technology advances, data processing system is required to handle and process large amounts of data. However, the existing On-Disk system has limit to process data which increase rapidly. For that reason, the In-Memory system is being used which saves and manages data on the fast memory not saving data into hard disk. Although it has fast processing capability, it is necessary to use the fault tolerance techniques in the In-Memory system because it has a risk of data loss due to volatility which is one of the memory characteristics. These fault tolerance techniques lead to performance degradation of In-Memory system. In this paper, we classify the data into Hot and Cold data in consideration of the data usage characteristics in the In-Memory system and propose compound backup technique to ensure data persistence. The proposed technique increases the persistence and improves performance degradation.

An Implementation of Fault Tolerant Software Distributed Shared Memory with Remote Logging (원격 로깅 기법을 이용하는 고장 허용 소프트웨어 분산공유메모리 시스템의 구현)

  • 박소연;김영재;맹승렬
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.31 no.5_6
    • /
    • pp.328-334
    • /
    • 2004
  • Recently, Software DSMs continue to improve its performance and scalability As Software DSMs become attractive on larger clusters, the focus of attention is likely to move toward improving the reliability of a system. A popular approach to tolerate failures is message logging with checkpointing, and so many log-based rollback recovery schemes have been proposed. In this work, we propose a remote logging scheme which uses the volatile memory of a remote node assigned to each node. As our remote logging does not incur frequent disk accesses during failure-free execution, its logging overhead is not significant especially over high-speed communication network. The remote logging tolerates multiple failures if the backup nodes of failed nodes are alive. It makes the reliability of DSMs grow much higher. We have designed and implemented the FT-KDSM(Fault Tolerant KAIST DSM) with the remote logging and showed the logging overhead and the recovery time.

Garbage Collection Protocol of Fault Tolerance Information in Multi-agent Environments (멀티에이전트 환경에서 결함 포용 정보의 쓰레기 처리 기법)

  • 이대원;정광식;이화민;신상철;이영준;유헌창;이원규
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.31 no.3_4
    • /
    • pp.204-212
    • /
    • 2004
  • Existing distributed systems have higher probability of failures occurrence than stand-alone system, so many fault tolerant techniques have been developed. Because of insufficient storage resulting from the increased fault tolerance information stored, the performance of system has been degraded. To avoid performance degradation, it needs delete useless fault tolerance information. In this paper, we propose a garbage collection algorithm for fault tolerance information. And we define and design the garbage collection agent for garbage collection of fault tolerance information, the information agent for management of fault tolerant data, and the facilitator agent for communication between agents. Also, we propose the garbage collection algorithm using the garbage collection agent. For rollback recovery, we use independent checkpointing protocol and sender based pessimistic message logging protocol. In our proposed garbage collection algorithm, the garbage collection, information, and facilitator agent is created with process, and the information agent constructs domain knowledge with its checkpoints and non-determistic events. And the garbage collection agent decides garbage collection time, and it deletes useless fault tolerance information in cooperation with the information and facilitator agent. For propriety of proposed garbage collection technique using agents, we compare domain knowledge of system that performs garbage collection after rollback recovery and domain knowledge of system that doesn't perform garbage collection.