Notes on checkpointing/recovery in message passing systems
Terminologies
- Process:-
 - Message:-
 - (Non)Volatile Storage:-
 
In order to make systems fault tolerant, checkpointing is employed. In this method different processes periodically checkpoint their state into a non-volatile storage. In the presence of a failure the checkpointed state is use to recover the processes to a earlier consistent state.
Things to optimize:-
- (Time spent in checkpointing) / (Time spent in processing).
 - Size of the checkpointed state.
 - Minimize the progress loss during recovery.
 - Minimize recovery time.
 
Checkpointing requires