Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics
Blog
This article describes Orca Security's implementation of a petabyte-scale transactional data lake using Apache Iceberg, Amazon S3, and AWS Analytics services.
- Orca migrated from data silos to centralized data lake for scalability and advanced analytics
- Apache Iceberg chosen for ACID guarantees, schema evolution, and engine-agnostic design
- Architecture uses Amazon MSK, Apache Spark, Amazon EMR, Athena, and AWS Glue
- Optimized EMR streaming ingestion using instance fleets and Kafka Spark properties
- Addressed small files problem via trigger tuning and Apache Iceberg compaction
- Implemented data retention using copy-on-write mode with snapshot expiration
- Monitored infrastructure via CloudWatch, Prometheus, and custom Spark metrics
- Achieved 50% cost reduction in data pipelines and query costs
Orca's data lake demonstrates how Apache Iceberg enables scalable, cost-effective transactional analytics at petabyte scale with significant operational improvements.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.