Home icon

Reducing costs for shuffle-heavy Apache Spark workloads with serverless storage for Amazon EMR Serverless

Big Data Blog



This article explains how serverless storage for Amazon EMR Serverless reduces costs for Apache Spark shuffle-heavy workloads by up to 26%, with savings reaching 85% for certain query patterns.

  • Serverless storage eliminates local disk provisioning for Spark workloads
  • Benchmarking on TPC-DS dataset showed 26.65% total cost savings versus standard disks
  • Benefits 80% of queries with average 47% savings when shuffle data is externalized
  • Requires Dynamic Resource Allocation enabled to release idle executors early
  • Runtime increases 37.94% due to external shuffle read/write latency
  • Inverted triangle queries (high cardinality input, low cardinality output) benefit most
  • Hourglass pattern queries with varying executor demand also see significant savings
  • Rectangle pattern queries with sustained high parallelism see minimal cost benefits

Serverless storage enables cost optimization for Spark workloads with dynamic resource patterns by decoupling shuffle data from compute, allowing immediate release of idle resources.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Dec 2
2025
Amazon EMR Serverless eliminates local storage provisioning for Apache Spark workloads
Jan 6
2026
Amazon EMR Serverless eliminates local storage provisioning, reducing data processing costs by up to 20%
Nov 21
2025
Amazon EMR Serverless now supports Apache Spark 4.0.1 (preview)
Dec 10
2024
Run Apache Spark Structured Streaming jobs at scale on Amazon EMR Serverless

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.