Reducing costs for shuffle-heavy Apache Spark workloads with serverless storage for Amazon EMR Serverless
Big Data Blog
This article explains how serverless storage for Amazon EMR Serverless reduces costs for Apache Spark shuffle-heavy workloads by up to 26%, with savings reaching 85% for certain query patterns.
- Serverless storage eliminates local disk provisioning for Spark workloads
- Benchmarking on TPC-DS dataset showed 26.65% total cost savings versus standard disks
- Benefits 80% of queries with average 47% savings when shuffle data is externalized
- Requires Dynamic Resource Allocation enabled to release idle executors early
- Runtime increases 37.94% due to external shuffle read/write latency
- Inverted triangle queries (high cardinality input, low cardinality output) benefit most
- Hourglass pattern queries with varying executor demand also see significant savings
- Rectangle pattern queries with sustained high parallelism see minimal cost benefits
Serverless storage enables cost optimization for Spark workloads with dynamic resource patterns by decoupling shuffle data from compute, allowing immediate release of idle resources.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2025
2026
2025
2024
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.