Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Big Data Blog

This article demonstrates how to optimize data layout and query performance by using partitioning and bucketing techniques with Amazon Athena and AWS Glue. It covers a use case where analysts need to run queries on a large public dataset (NOAA Integrated Surface Database) and complete them within 10 seconds while optimizing costs.

Specifically, the article covers:

Creating a baseline table and evaluating its query performance
Optimizing data layout using Athena CTAS (Create Table As Select) with partitioning and bucketing
Optimizing data layout using AWS Glue ETL with partitioning and Spark-based bucketing
Optimizing data layout for Apache Iceberg tables with hidden partitioning and bucketing
Comparing the query performance and data scan sizes across different table configurations
Conclusion: Bucketing can contribute to accelerating query latency and reducing data scan size, further optimizing costs

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Aug 8
2024

Query AWS Glue Data Catalog views using Amazon Athena and Amazon Redshift

Dec 3
2024

Introducing AWS Glue Data Catalog automation for table statistics collection for improved query performance on Amazon Redshift and Amazon Athena

Aug 8
2024

AWS Glue Data Catalog views are now GA with Amazon Athena and Amazon Redshift

Dec 19
2024

AWS Glue Data Catalog offers advanced automatic optimization for Apache Iceberg tables

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Related articles