Beyond JSON blobs: Implementing the VARIANT data type in Apache Iceberg V3

Big Data Blog

This article introduces Apache Iceberg V3's VARIANT data type for efficiently storing and querying semi-structured JSON data in data lakes, demonstrated on Amazon EMR Serverless.

VARIANT uses binary encoding and columnar shredding instead of storing JSON as text strings
Queries access only needed fields without deserializing entire JSON documents
Reduces storage footprint through efficient compression and query processing time
Create Iceberg V3 tables with VARIANT columns using format-version=3
Use parse_json() to convert JSON strings to binary VARIANT format at write time
Extract fields with variant_get() function using JSON path syntax
Supports nested object access, array indexing, and type-specific extraction
Ideal for IoT sensors, clickstream analytics, and log data with evolving schemas
Amazon EMR Serverless 8.0 includes native Iceberg V3 and VARIANT support
Part 1 covers basics; Part 2 will benchmark performance against string storage

VARIANT bridges JSON flexibility with columnar performance, enabling efficient semi-structured data management without predefined schemas.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Apr 20
2026

Building unified data pipelines with Apache Iceberg and Apache Flink

Apr 3
2024

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

May 20
2024

Understanding Apache Iceberg on AWS with the new technical guide

Nov 14
2024

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Beyond JSON blobs: Implementing the VARIANT data type in Apache Iceberg V3

Related articles