Scaling seismic foundation models on AWS: Distributed training with Amazon SageMaker HyperPod and expanding context windows
Machine Learning Blog
This article describes how TGS, a geoscience data provider, partnered with AWS to optimize seismic foundation model training using Amazon SageMaker HyperPod, achieving dramatic performance improvements and expanded analytical capabilities.
- Reduced training time from 6 months to 5 days using distributed training across 16 EC2 P5 instances
- Achieved near-linear scaling (90-95% parallel efficiency) across 128 GPUs with DeepSpeed ZeRO-2
- Streamed data directly from Amazon S3 instead of FSx, reducing storage costs by over 90%
- Implemented ring attention and context parallelism to expand model context window 4.5x
- Enabled processing of larger 3D seismic volumes for better geological pattern detection
- Key optimization: DeepSpeed ZeRO-2 outperformed FSDP2 and ZeRO-3 for this workload
The solution demonstrates how optimized data pipelines, distributed training frameworks, and advanced parallelization techniques enable efficient scaling of foundation models for specialized scientific domains.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2025
2025
2025
2024
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.