Data-driven decision making is accelerating and defining the way organizations work. With this transformation, there has been a rapid adoption of data lakes across the industry.
To fuel this transformation, data lakes have evolved over the last decade making Apache Hive as the de-facto standard for data lakes. However, while Apache Hive can solve some of the issues with processing of data, it falls short at a few other objectives for next generation data processing.
In this blog, we will discuss the drawbacks of current existing data lake architecture (Apache Hive), see what Apache Iceberg is and how it overcomes the shortcomings of the current state of data lakes. Additionally, we will review design differences between Apache Hive and Iceberg.
Apache Icebeg is an open table format, originally designed at Netflix in order to overcome the challenges faced when using already existing data lake formats like Apache Hive.
The key problems Iceberg tries to address are:
Iceberg was open sourced in 2018 as an Apache Incubator project and graduated from the incubator on the 19th of May 2020.
Apache Iceberg is a new table format design, which addresses the issues faced by Apache Hive. When used at scale with large datasets, there are many issues due to its design. Some of the key challenges faced by Apache Hive are:
Apache Iceberg, on the other hand, is a new open table format which is designed to overcome the drawbacks faced when using Apache Hive. The key difference lies in how Apache Iceberg stores records in object storage.
Source: https://iceberg.apache.org/spec/
Expressive SQL: Iceberg supports flexible SQL commands to merge new data, update existing rows and perform targeted deletes on tables. Due to its architecture under the hood, Iceberg supports execution of analytical queries on data lakes.
Schema Evolution: Adding, renaming and reordering the column names works well and schema changes never require rewriting of the complete table, as the column names are uniquely identified in the metadata layer with id’s rather than the name of the column itself.
Hidden Partitions: Partitioning in Iceberg is dynamic. For example if an event time (timestamp) column is present in the table, the table can be partitioned by date from the timestamp column. Apache Iceberg manages the relationship between the event timestamp column and the date. The partitioning is managed by Apache Iceberg. Additional levels of partitioning can be performed, and these are tacked on snapshot via metadata files.
Time Travel and Rollback: Apache Iceberg supports 2 types of read options for snapshots, which support time travel and incremental reads. These are the options supported:
• snapshot-id – selects a specific table snapshot
• as-of-timestamp – selects the current snapshot at a timestamp, in milliseconds.
Apache Iceberg integration has multiple AWS service integrations with query engines, catalogues and infrastructure to run.
AWS supports integrations with the following engines and setting up custom catalogs.
There are multiple options that users can choose from. to build an Iceberg catalog with AWS
Apache Iceberg supports integration with Glue Catalog, where Apache Iceberg namespace is stored as a Glue database and an Apache Iceberg table is stored as a Glue table and every Apache Iceberg table version is stored as a Glue table version. The following is an example for configuring Spark SQL and Glue catalog.
For commit locking, Glue catalog uses DynamoDB for concurrent commits and for file IO and storage, glue utilizes S3.
The DynamoDB catalog capability is still in the preview stage. DynamoDB catalog avoids hot partition issues during heavy write traffic to the tables. DynamoDB provides the best performance through optimistic locking when high rates of read and write throughputs are required.
The JDBC catalog uses a table in a relational database to manage the Apache Iceberg tables, The tables can serve AWS RDS as the catalog. RDS catalog is recommended when the organization already has an existing serverless managed table. This provides easy integration. Here is an example to configure Iceberg with RDS as a catalog for the spark engine.
Athena provides integration with Iceberg, currently it is in preview. To run Iceberg queries with Athena, create a workload with the name “AmazonAthenaIcebergPreview” and run the iceberg related queries using this workload, Currently Athena engine supports read, write and update to Iceberg tables.
AWS EMR 6.5.0 and later has Apache Iceberg dependencies pre-installed without requiring any additional bootstrap-actions. However, for versions before 6.5.0, These dependencies need to be added to bootstrap-actions to use iceberg tables. EMR provided Spark, Hive, Flink and Trino that can run Iceberg.
AWS EKS can run any Spark, Hive, Flink, Trino and Presto clusters which can be integrated with Iceberg. Similarly, AWS Kinesis can be integrated with Flink, which has connectivity to use Apache Iceberg.
Apache Iceberg has integrations with various query and execution engines, where the Apache Iceberg tables can be created and managed by these connectors. The engines that support Iceberg are Spark, Flink, Hive, Presto ,Trino, Dremio and Snowflake
As organizations move towards data-driven decision making, the importance of lake house style architectures are increasing rapidly. Apache Iceberg being a new open table format which can scale and evolve seamlessly, provides key benefits over its predecessor Apache Hive.
Apache Iceberg is best suited for batch and micro batch processing of datasets. The growing open source community and integrations from multiple cloud providers makes it easier to integrate Apache Iceberg on to existing architecture effectively.
For any questions about modernizing your data strategy, feel free to contact
Use the Feedback tab to make any comments or ask questions. You can also start a conversation with us.