Capture data lineage of Amazon EMR spark jobs into Amazon SageMaker Unified Studio
Big Data Blog
This article demonstrates how to capture and visualize data lineage from Apache Spark jobs running on Amazon EMR into Amazon SageMaker Unified Studio using native OpenLineage support.
- EMR v7.11+ includes native OpenLineage support for automatic lineage capture
- OpenLineage metadata flows directly to Amazon SageMaker Catalog without customization
- Solution uses Apache Iceberg tables with AWS Glue Data Catalog for metadata storage
- Step-by-step walkthrough includes HR analytics pipeline with two Spark transformation jobs
- SageMaker Unified Studio visualizes complete lineage graph from source CSV files to final tables
- Column-level lineage tracking shows data transformations at granular detail
- CloudFormation template automates deployment of EMR cluster with OpenLineage pre-configured
- Enables data governance, compliance audits, and impact analysis across data pipelines
This integration provides automated end-to-end data lineage visibility, strengthening governance while maintaining agility in analytics workflows.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Mar 17
2026
2026
Amazon SageMaker Unified Studio supports aggregated view of data lineage
Jun 24
2025
2025
Capture data lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker
Oct 13
2025
2025
Visualize data lineage using Amazon SageMaker Catalog for Amazon EMR, AWS Glue, and Amazon Redshift
Feb 4
2026
2026
Apache Spark lineage now available in Amazon SageMaker Unified Studio for IDC based domains
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.