Home icon

Use Amazon MSK Connect and Iceberg Kafka Connect to build a real-time data lake

Big Data Blog



This article demonstrates how to build a real-time data lake using Amazon MSK Connect and Iceberg Kafka Connect for continuous data synchronization from transactional databases to Apache Iceberg tables on Amazon S3.

  • Captures CDC data from Amazon RDS MySQL using Debezium connector
  • Streams data through Amazon MSK to Iceberg tables with exactly-once delivery
  • Supports single-table and multi-table synchronization modes
  • Automatically handles schema evolution and field changes
  • Achieves approximately 10,000 records per second per MCU throughput
  • Requires custom Kafka Connect plugins built from open source
  • Integrates with AWS Glue Data Catalog for table management
  • Includes compaction workflows to optimize query performance

This solution provides a fully managed, low-operational-complexity approach for real-time data ingestion into data lakes, suitable for high-volume transactional workloads.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Apr 3
2024
Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake
Jul 14
2025
Build real-time data lakes with Snowflake and Amazon S3 Tables
Jun 20
2025
Stream data from Amazon MSK to Apache Iceberg tables in Amazon S3 and Amazon S3 Tables using Amazon Data Firehose
Mar 13
2025
Build a managed Apache Iceberg data lake using Starburst and Amazon S3 Tables

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.