Notes on checkpointing/recovery in message passing systems
Terminologies
- Process:-
- Message:-
- (Non)Volatile Storage:-
In order to make systems fault tolerant, checkpointing is employed. In this method different processes periodically checkpoint their state into a non-volatile storage. In the presence of a failure the checkpointed state is use to recover the processes to a earlier consistent state.
Things to optimize:-
- (Time spent in checkpointing) / (Time spent in processing).
- Size of the checkpointed state.
- Minimize the progress loss during recovery.
- Minimize recovery time.
Checkpointing requires