Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK
Big Data Blog
This article discusses a solution for synchronizing data lakes with change data capture (CDC) based UPSERT using open table format, AWS Glue, and Amazon MSK. It covers the following:
- Capturing data changes from MySQL database using Debezium connector and streaming to Amazon MSK
- Streaming data from MSK to Amazon S3 using Confluent S3 Sink Connector
- Processing CDC raw data from S3 using AWS Glue ETL job and writing to data lake in open file format (Delta Lake)
- Querying Delta Lake table using Amazon Athena for data analysis
- Illustrating insert, update, and delete operations on the source MySQL data and updating Delta Lake
The solution enables near real-time synchronization of data lakes with database changes, maintaining data consistency and integrity for analytics use cases.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Aug 15
2024
2024
Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone
Jun 10
2024
2024
Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation
Oct 30
2024
2024
Modernize your legacy databases with AWS data lakes, Part 3: Build a data lake processing layer
Jul 14
2025
2025
Build real-time data lakes with Snowflake and Amazon S3 Tables
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.