Big Data Blog
This article announces the general availability of Apache Spark 4.0 on Amazon EMR, introducing major improvements for data processing, semi-structured data handling, and streaming workloads.
- Spark Connect enables remote PySpark development from IDEs without local Spark installation
- VARIANT data type natively supports semi-structured JSON without upfront schema definition
- Apache Iceberg V3 integration enables efficient semi-structured storage with schema evolution
- SQL scripting adds procedural logic (variables, conditionals, loops) directly in SQL
- Python Data Source API allows building custom connectors entirely in Python
- Queryable state for streaming enables live state inspection without stopping jobs
- EMR Serverless runs Spark workloads up to 4.5× faster than open-source Apache Spark
- EMR-spark-8.0 includes Python 3.11, Java 17, and simplified patch management
- Available across EMR on EC2, EMR on EKS, and EMR Serverless deployment options
Spark 4.0 on Amazon EMR simplifies data processing by reducing schema complexity, enabling interactive development at production scale, and providing better observability for streaming workloads.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.