Beyond JSON blobs: Implementing the VARIANT data type in Apache Iceberg V3
Big Data Blog
This article introduces Apache Iceberg V3's VARIANT data type for efficiently storing and querying semi-structured JSON data in data lakes, demonstrated on Amazon EMR Serverless.
- VARIANT uses binary encoding and columnar shredding instead of storing JSON as text strings
- Queries access only needed fields without deserializing entire JSON documents
- Reduces storage footprint through efficient compression and query processing time
- Create Iceberg V3 tables with VARIANT columns using format-version=3
- Use parse_json() to convert JSON strings to binary VARIANT format at write time
- Extract fields with variant_get() function using JSON path syntax
- Supports nested object access, array indexing, and type-specific extraction
- Ideal for IoT sensors, clickstream analytics, and log data with evolving schemas
- Amazon EMR Serverless 8.0 includes native Iceberg V3 and VARIANT support
- Part 1 covers basics; Part 2 will benchmark performance against string storage
VARIANT bridges JSON flexibility with columnar performance, enabling efficient semi-structured data management without predefined schemas.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Apr 20
2026
2026
Building unified data pipelines with Apache Iceberg and Apache Flink
Apr 3
2024
2024
Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake
May 20
2024
2024
Understanding Apache Iceberg on AWS with the new technical guide
Nov 14
2024
2024
Expand data access through Apache Iceberg using Delta Lake UniForm on AWS
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.