Capture data lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker
Big Data Blog
This article discusses how to capture data lineage from various tools like dbt, Apache Airflow, and Apache Spark using Amazon SageMaker and OpenLineage.
- SageMaker now provides centralized lineage metadata tracking across data assets
- An OpenLineage HTTP Proxy solution enables streaming lineage events from different data processing tools
- The proxy architecture uses API Gateway, SQS, and Lambda to route and process lineage events
- Demonstrated configuration for OpenLineage integration with:
- AWS Glue 4.0 Spark jobs
- dbt data pipelines
- Apache Airflow (Amazon MWAA)
- Helps improve data governance and trust by providing transparent data origin and transformation tracking
The solution simplifies data asset governance by enabling centralized lineage tracking across different data processing tools and platforms.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Jul 30
2025
2025
Automate data lineage in Amazon SageMaker using AWS Glue Crawlers supported data sources
May 21
2026
2026
Capture data lineage of Amazon EMR spark jobs into Amazon SageMaker Unified Studio
Mar 17
2026
2026
Amazon SageMaker Unified Studio supports aggregated view of data lineage
Oct 13
2025
2025
Visualize data lineage using Amazon SageMaker Catalog for Amazon EMR, AWS Glue, and Amazon Redshift
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.