Cloud Blog: BigQuery tables for Apache Iceberg: optimized storage for the open lakehouse

Source URL: https://cloud.google.com/blog/products/data-analytics/announcing-bigquery-tables-for-apache-iceberg/
Source: Cloud Blog
Title: BigQuery tables for Apache Iceberg: optimized storage for the open lakehouse

Feedly Summary: For several years, BigQuery native tables have supported enterprise-level data management capabilities such as ACID transactions, streaming ingestion, and automatic storage optimizations. Many BigQuery customers store data in data lakes using open-source file formats such as Apache Parquet and table formats such as Apache Iceberg. In 2022, we launched BigLake tables to allow customers to maintain a single copy of data and benefit from the security and performance offered by BigQuery. However, BigLake tables are currently read-only; BigQuery customers have to perform data mutations through external query engines and manually orchestrate data management. Another challenge is the “small files problem” during ingestion: because cloud object stores do not support appends, table writes need to be micro-batched, requiring trade-offs between performance and data consistency.
Today, we’re excited to announce the preview of BigQuery tables for Apache Iceberg, a fully managed, Apache Iceberg-compatible storage engine from BigQuery with features such as autonomous storage optimizations, clustering, and high-throughput streaming ingestion. BigQuery tables for Apache Iceberg use the Apache Iceberg format to store data in customer-owned cloud storage buckets while providing a similar customer experience and feature set as BigQuery native tables. Through BigQuery tables for Apache Iceberg, we are bringing a decade of BigQuery innovations to the lakehouse.

BigQuery tables for Apache Iceberg are writable from BigQuery through GoogleSQL data manipulation language (DML) and supports high-throughput streaming ingestion from open-source engines such as Apache Spark through BigQuery’s Write API. Here is an example to create a BigLake managed table with clustering:

code_block
)])]>

Fully managed enterprise storage for the lakehouse
BigQuery tables for Apache Iceberg address the limitations of open-source table formats. With BigQuery tables for Apache Iceberg, BigQuery takes care of table-maintenance tasks autonomously without customer toil. BigQuery keeps the table optimized by combining smaller files into optimal file sizes, providing automatic re-clustering of data and garbage collection of files. For instance, optimal file sizes are adaptively determined based on the size of the table. BigQuery tables for Apache Iceberg benefit from over a decade of expertise running automated storage optimization for BigQuery native tables efficiently and cost-effectively. There is no need to run OPTIMIZE or VACUUM manually.
For high-throughput streaming ingestion, BigQuery tables for Apache Iceberg leverage Vortex, an exabyte-scale structured storage system that powers the BigQuery storage write API. BigQuery tables for Apache Iceberg durably store recently ingested tuples in a row-oriented format and periodically convert them to Parquet. The high-throughput ingestion and parallel reads are supported through the open-source Spark and Flink BigQuery connectors. Pub/Sub and Datastream can ingest data into BigQuery tables for Apache Iceberg, so you don’t need to maintain bespoke infrastructure.
BigQuery tables for Apache Iceberg store table metadata in BigQuery’s scalable metadata management system. BigQuery stores fine-grained metadata and uses distributed query processing and data management techniques to handle metadata. As a result, BigQuery tables for Apache Iceberg aren’t constrained by needing to commit the metadata to object stores, allowing a higher rate of mutations than what is possible with table formats. Since writers cannot directly mutate the transaction log, the table metadata is tamper-proof, and has a reliable audit history.
BigQuery tables for Apache Iceberg continue to support fine-grained security policies enforced by the storage APIs while extending support for governance policy management, data quality and end-to-end lineage through Dataplex.

BigQuery tables for Apache Iceberg export metadata into Iceberg snapshots in cloud storage. The pointer to the latest exported metadata will soon be registered in BigQuery metastore, a serverless runtime metadata service announced earlier this year. Iceberg metadata exports allows any engine capable of understanding Iceberg to query the data directly from Cloud Storage. 
Learn more
Customers like HCA Healthcare, one of the largest health care providers in the world, see value in leveraging BigQuery tables for Apache Iceberg as their Apache Iceberg-compatible storage layer from BigQuery, making new lakehouse use-cases possible. The preview of BigQuery tables for Apache Iceberg is available in all Google Cloud regions today. You can get started today by following the documentation.

AI Summary and Description: Yes

**Summary:** The text details the launch of BigQuery tables for Apache Iceberg, which enhance data management capabilities for professionals leveraging cloud-based data lakes. It highlights advancements in performance, security, and usability, particularly for high-throughput streaming ingestion and autonomous table maintenance, thereby addressing significant challenges in data management for users of open-source data formats.

**Detailed Description:**
The announcement of BigQuery tables for Apache Iceberg represents a pivotal development for organizations utilizing data lakes and looking for streamlined data management in a cloud environment. Here are the major points outlined in the text:

– **BigQuery’s Existing Capabilities:** Historically, BigQuery has supported ACID transactions, streaming ingestion, and automatic storage optimizations, appealing to enterprise customers dealing with large datasets.

– **BigLake Tables Introduction:** BigLake tables were previously introduced to centralize data management, allowing for security and performance benefits but were limited to read-only operations. This necessitated external query engines for data mutations, posing complexity.

– **Challenges Noted:**
– **Small Files Problem:** Ingestion processes faced difficulties due to the need for micro-batching in cloud object stores, creating a trade-off between performance and consistency.
– **Manual Maintenance:** Prior operations required manual actions such as optimizing files and maintaining tables.

– **Key Features of BigQuery Tables for Apache Iceberg:**
– **Writable Tables:** Unlike BigLake tables, these are writable and allow data mutations through GoogleSQL DML, providing a more seamless user experience akin to using native BigQuery tables.
– **Enhanced Streaming Ingestion:** The tables support high-throughput streaming ingestion from open-source engines like Apache Spark, with the underlying architecture optimizing data storage and management.
– **Automated Maintenance:** The service autonomously manages table maintenance tasks such as file optimization, clustering, and garbage collection without requiring customer intervention.

– **Security and Governance:** BigQuery tables for Apache Iceberg have fine-grained security measures facilitated by storage APIs, alongside capabilities for governance policy management, data quality, and end-to-end lineage via Dataplex.

– **Integration & Accessibility:** The storage metadata can be exported as Iceberg snapshots in cloud storage, enabling cross-compatibility with various engines while maintaining integrity and audit trails.

– **Practical Implications for Users:** The introduction of BigQuery tables for Apache Iceberg promises to streamline data management tasks, enhance data processing performance, and improve governance and security compliance, making it a valuable tool for organizations like HCA Healthcare in their data strategy.

This launch is significant for cloud security professionals and data architects seeking robust solutions for managing extensive datasets while ensuring compliance and security protocols are maintained efficiently.