Orca Security built a petabyte-scale transactional data lake using Apache Iceberg, Amazon S3, and AWS Analytics services, achieving over 50% cost reductions in both data pipeline and query operations while enabling advanced anomaly detection and cloud security capabilities.


<div><p>This article describes Orca Security's implementation of a petabyte-scale transactional data lake using Apache Iceberg, Amazon S3, and AWS Analytics services.</p><ul><li>Orca migrated from data silos to centralized data lake for scalability and advanced analytics</li><li>Apache Iceberg chosen for ACID guarantees, schema evolution, and engine-agnostic design</li><li>Architecture uses Amazon MSK, Apache Spark, Amazon EMR, Athena, and AWS Glue</li><li>Optimized EMR streaming ingestion using instance fleets and Kafka Spark properties</li><li>Addressed small files problem via trigger tuning and Apache Iceberg compaction</li><li>Implemented data retention using copy-on-write mode with snapshot expiration</li><li>Monitored infrastructure via CloudWatch, Prometheus, and custom Spark metrics</li><li>Achieved 50% cost reduction in data pipelines and query costs</li></ul><p>Orca's data lake demonstrates how Apache Iceberg enables scalable, cost-effective transactional analytics at petabyte scale with significant operational improvements.</p></div>


Related articles