Save up to 90% using EC2 Spot, even for long-running HPC jobs
HPC Blog
This article discusses how to save up to 90% on compute costs for long-running HPC jobs by using Amazon EC2 Spot Instances, even when the applications don't handle spot interruptions natively. It covers a technique using checkpoint/restore solutions that capture the state of the job and allow it to resume on a new instance if interrupted.
Specifically, the article covers:
- The checkpoint/restore process and how it handles spot interruptions
- Testing methodology to validate the approach
- Operational considerations like compute requirements, scheduler integration, storage performance/permissions/capacity, license management, incremental checkpoints, and handling spot interruptions
- Constraints specific to checkpointing tools
- Considerations for workloads spanning multiple nodes
- The overall cost-saving potential and conclusion
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2025
2024
2024
2025
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.