Measure performance of AWS Glue Data Quality for ETL pipelines
Big Data Blog
This article provides benchmark results for running increasingly complex data quality rulesets over a predefined test dataset using AWS Glue Data Quality. It shows how AWS Glue Data Quality provides information about the runtime, resources used (measured in DPUs), and cost for running the data quality checks as part of an ETL pipeline.
Specifically, the article covers:
- The test dataset details (104 columns, 1 million rows in Parquet format)
- Defining different data quality rulesets with varying complexity (from 1 rule to 400 rules)
- Creating AWS Glue ETL jobs with the rulesets and running them
- Reviewing the performance metrics like job duration, DPU usage, and estimated cost
- Analyzing the cost breakdown using AWS Cost Explorer with user-defined tags
- Conclusion that AWS Glue Data Quality scales well with increasing ruleset complexity while keeping costs low
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2024
2024
2026
2025
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.