Home icon

Measure performance of AWS Glue Data Quality for ETL pipelines

Big Data Blog



This article provides benchmark results for running increasingly complex data quality rulesets over a predefined test dataset using AWS Glue Data Quality. It shows how AWS Glue Data Quality provides information about the runtime, resources used (measured in DPUs), and cost for running the data quality checks as part of an ETL pipeline.

Specifically, the article covers:

  • The test dataset details (104 columns, 1 million rows in Parquet format)
  • Defining different data quality rulesets with varying complexity (from 1 rule to 400 rules)
  • Creating AWS Glue ETL jobs with the rulesets and running them
  • Reviewing the performance metrics like job duration, DPU usage, and estimated cost
  • Analyzing the cost breakdown using AWS Cost Explorer with user-defined tags
  • Conclusion that AWS Glue Data Quality scales well with increasing ruleset complexity while keeping costs low


Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

May 23
2024
Get started with AWS Glue Data Quality dynamic rules for ETL pipelines
May 10
2024
Troubleshooting AWS Glue ETL Jobs using Amazon CloudWatch Logs Insights enhanced queries
Mar 26
2026
Build AWS Glue Data Quality pipeline using Terraform
Jun 13
2025
From raw to refined: building a data quality pipeline with AWS Glue and Amazon S3 Tables

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.