Using lakeFS with Apache Iceberg¶

lakeFS Iceberg REST Catalog¶

Info

Available in lakeFS Enterprise

Tip

lakeFS Iceberg REST Catalog is currently in private preview for lakeFS Enterprise customers. Contact us to get started!

What is lakeFS Iceberg REST Catalog?¶

lakeFS Iceberg REST Catalog allow you to use lakeFS as a spec-compliant Apache Iceberg REST catalog, allowing Iceberg clients to manage and access tables using a standard REST API.

lakeFS Iceberg REST Catalog

Using lakeFS Iceberg REST Catalog, you can use lakeFS a drop-in replacement for other Iceberg catalogs like AWS Glue, Nessie, Hive Metastore - or the lakeFS HadoopCatalog (see below)

With lakeFS Iceberg REST Catalog, you can:

Manage Iceberg tables with full version control capabilities.
Use standard Iceberg clients and tools without modification.
Leverage lakeFS's branching and merging features for managing table's lifecycle.
Maintain data consistency across different environments.

Use Cases¶

Version-Controlled Data Development:
- Create feature branches for table schema changes or data migrations
- Test modifications in isolation, across multiple tables
- Merge changes safely with conflict detection
Multi-Environment Management:
- Use branches to represent different environments (dev, staging, prod)
- Promote changes between environments through merges, with automated testing
- Maintain consistent table schemas across environments
Collaborative Data Development:
- Multiple teams can work on different table features simultaneously
- Maintain data quality through pre-merge validations
- Collaborate using pull requests on changes to data and schema
Manage and Govern Access to data:
- Use the detailed built-in commit log capturing who, what and how data is changed
- Manage access using fine grained access control to users and groups using RBAC policies
- Rollback changes atomically and safely to reduce time-to-recover and increase system stability

Configuration¶

The Iceberg REST catalog API is exposed at /iceberg/api in your lakeFS server.

To use it:

Enable the feature (contact us for details).
Configure your Iceberg clients to use the lakeFS REST catalog endpoint.
Use your lakeFS access key and secret for authentication.

Catalog Initialization Example (using `pyiceberg`)¶

from pyiceberg.catalog.rest import RestCatalog

catalog = RestCatalog(name = "my_catalog", **{
    'prefix': 'lakefs',
    'uri': f'{lakefs_endpoint}/iceberg/api',
    'oauth2-server-uri': f'{lakefs_endpoint}/iceberg/api/v1/oauth/tokens',
    'credential': f'{lakefs_client_key}:{lakefs_client_secret}',
})

Example Client code¶

Python (PyIceberg)TrinoSpark

import lakefs
from pyiceberg.catalog import load_catalog

# Initialize the catalog
catalog = RestCatalog(name = "my_catalog", **{
    'prefix': 'lakefs',
    'uri': 'https://lakefs.example.com/iceberg/api',
    'oauth2-server-uri': 'https://lakefs.example.com/iceberg/api/iceberg/api/v1/oauth/tokens',
    'credential': f'AKIAlakefs12345EXAMPLE:abc/lakefs/1234567bPxRfiCYEXAMPLEKEY',
})

# List namespaces in a branch
catalog.list_namespaces(('repo', 'main'))

# Query a table
catalog.list_tables('repo.main.inventory')
table = catalog.load_table('repo.main.inventory.books')
arrow_df = table.scan().to_arrow()

-- List tables in the iceberg catalog
USE "repo.main.inventory"; -- <repository>.<branch or reference>.<namespace>
SHOW TABLES;

-- Query a table
SELECT * FROM books LIMIT 100;

-- Switch to a different branch
USE "repo.new_branch.inventory";
SELECT * FROM books;

// Configure Spark to use the lakeFS REST catalog
spark.sql("USE my_repo.main.inventory")

// List available tables
spark.sql("SHOW TABLES").show()

// Query data with branch isolation
spark.sql("SELECT * FROM books").show()

// Switch to a feature branch
spark.sql("USE my_repo.new_branch.inventory")
spark.sql("SELECT * FROM books").show()

Namespaces and Tables¶

Namespace Operations¶

The Iceberg Catalog supports Iceberg namespace operations:

Create namespaces
List namespaces
Drop namespaces
List tables within namespaces

Namespace Usage¶

Namespaces in the Iceberg Catalog follow the pattern "<repository>.<branch>.<namespace>(.<namespace>...)" where:

<repository> must be a valid lakeFS repository name.
<branch> must be a valid lakeFS branch name.
<namespace> components can be nested using unit separator (e.g., inventory.books).

Examples:

my-repo.main.inventory
my-repo.feature-branch.inventory.books

The repository and branch components must already exist in lakeFS before using them in the Iceberg catalog.

Namespace Restrictions¶

Repository and branch names must follow lakeFS naming conventions.
Namespace components cannot contain special characters except dots (.) for nesting.
The total namespace path length must be less than 255 characters.
Namespaces are case-sensitive.
Empty namespace components are not allowed.

Table Operations¶

The Iceberg Catalog supports all standard Iceberg table operations:

Create tables with schemas and partitioning.
Update table schemas and partitioning.
Commit changes to tables.
Delete tables.
List tables in namespaces.

Version Control Features¶

The Iceberg Catalog integrates with lakeFS's version control system, treating each table change as a commit. This provides a complete history of table modifications and enables branching and merging workflows.

Catalog Changes as Commits¶

Each modification to a table (schema changes, data updates, etc.) creates a new commit in lakeFS. Creating or deleting a namespace or a table results in a lakeFS commit on the relevant branch, as well as table data updates ("Iceberg table commit").

Branching and Merging¶

Create a new branch to work on table changes:

# Create a lakeFS branch using lakeFS Python SDK
branch = lakefs.repository('repo').branch('new_branch').create(source_reference='main')

# The table is now accessible in the new branch
new_table = catalog.load_table(f'repo.{branch.id}.inventory.books')

Merge changes between branches:

# Merge the branch using lakeFS Python SDK
branch.merge_into('main')

# Changes are now visible in main
main_table = catalog.load_table('repo.main.inventory.books')

Info

Currently, lakeFS handles table changes as file operations during merges.

This means that when merging branches with table changes, lakeFS treats the table metadata files as regular files.

No special merge logic is applied to handle conflicting table changes, and if there are conflicting changes to the same table in different branches, the merge will fail with a conflict that needs to be resolved manually.

Authentication¶

lakeFS provides an OAuth2 token endpoint at /catalog/iceberg/v1/oauth/tokens that clients need to configure. To authenticate, clients must provide their lakeFS access key and secret in the format access_key:secret as the credential.

The authorization requirements are managed at the lakeFS level, meaning:

Users need appropriate lakeFS permissions to access repositories and branches
Table operations require lakeFS permissions on the underlying objects
The same lakeFS RBAC policies apply to Iceberg catalog operations

Limitations¶

Table Maintenance:
- See Table Maintenance section for details
Advanced Features:
- Views (all view operations are unsupported)
- Transactional DML (stage-create)
- Server-side query planning
- Table renaming
- Updating table's location (using Commit)
- Table statistics (set-statistics and remove-statistics operations are currently a no-op)
lakeFS Iceberg REST Catalog is currently tested to work with Amazon S3 and Google Cloud Storage. Other storage backends, such as Azure or Local storage are currently not supported, but will be in future releases.
Currently only Iceberg v2 table format is supported

Table Maintenance¶

The following table maintenance operations are not supported in the current version:

Danger

To prevent data loss, clients should disable their own cleanup operations by:

Disabling orphan file deletion.
Setting remove-dangling-deletes to false when rewriting.
Disabling snapshot expiration.
Setting a very high value for min-snapshots-to-keep parameter.

Roadmap¶

The following features are planned for future releases:

Catalog Sync:
- Support for pushing/pulling tables to/from other catalogs
- Integration with AWS Glue and other Iceberg-compatible catalogs
Table Import:
- Support for importing existing Iceberg tables from other catalogs
- Bulk import capabilities for large-scale migrations
Azure Storage Support
Advanced Features:
- Views API support
- Table transactions
Advanced versioning capabilities
- merge non-conflicting table updates

How it works¶

Under the hood, the lakeFS Iceberg REST Catalog keeps track of each table's metadata file. This is typically referred to as the table pointer.

This pointer is stored inside the repository's storage namespace.

When a request is made, the catalog would examine the table's fully qualified namespace: <repository>.<reference>.<namespace>.<table_name> to read that special pointer file from the given reference specified, and returns the underlying object store location of the metadata file to the client. When a table is created or updated, lakeFS would make sure to generate a new metadata file inside the storage namespace, and register that metadata file as the current pointer for the requested branch.

This approach leverages Iceberg's existing metadata and the immutability of its snapshots: a commit in lakeFS captures a metadata file, which in turn captures manifest lists, manifest files and all related data files.

Besides simply avoiding "double booking" where both Iceberg and lakeFS would need to keep track of which files belong to which version, it also greatly improves the scalability and compatibility of the catalog with the existing Iceberg tool ecosystem.

Example: Reading an Iceberg Table¶

Here's a simplified example of what reading from an Iceberg table would look like:

sequenceDiagram
    Actor Iceberg Client
    participant lakeFS Catalog API
    participant lakeFS
    participant Object Store

    Iceberg Client->>lakeFS Catalog API: get table metadata("repo.branch.table")
    lakeFS Catalog API->>lakeFS: get('repo', 'branch', 'table')
    lakeFS->>lakeFS Catalog API: physical_address
    lakeFS Catalog API->>Iceberg Client: object location ("s3://.../metadata.json")
    Iceberg Client->>Object Store: GetObject
    Object Store->>Iceberg Client: table data

Example: Writing an Iceberg Table¶

Here's a simplified example of what writing to an Iceberg table would look like:

sequenceDiagram
    Actor Iceberg Client
    participant lakeFS Catalog API
    participant lakeFS
    participant Object Store
    Iceberg Client->>Object Store: table data
    Iceberg Client->>lakeFS Catalog API: commit
    lakeFS Catalog API->>lakeFS: create branch("branch-tx-uuid")
    lakeFS Catalog API->>lakeFS: put('new table pointer')
    lakeFS->>Object Store: PutObject("metdata.json")
    lakeFS Catalog API->>lakeFS: merge('branch-tx-uuid', 'branch)
    lakeFS Catalog API->>Iceberg Client: done

Deprecated: Iceberg HadoopCatalog¶

Warning

HadoopCatalog and other filesystem-based catalogs are currently not recommended by the Apache Iceberg community and come with several limitations around concurrency and tooling.

As such, the HadoopCatalog described in this section is now deprecated and will not receive further updates

Setup

MavenPySpark

Use the following Maven dependency to install the lakeFS custom catalog:

<dependency>
<groupId>io.lakefs</groupId>
<artifactId>lakefs-iceberg</artifactId>
<version>0.1.4</version>
</dependency>

Include the lakefs-iceberg jar in your package list along with Iceberg. For example:

.config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0,io.lakefs:lakefs-iceberg:0.1.4")

Configuration

PySparkSpark Shell

Set up the Spark SQL catalog:

.config("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") \
.config("spark.sql.catalog.lakefs.warehouse", f"lakefs://{repo_name}") \ 
.config("spark.sql.catalog.lakefs.cache-enabled", "false")

Configure the S3A Hadoop FileSystem with your lakeFS connection details. Note that these are your lakeFS endpoint and credentials, not your S3 ones.

.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.endpoint", "https://example-org.us-east-1.lakefscloud.io") \
.config("spark.hadoop.fs.s3a.access.key", "AKIAIO5FODNN7EXAMPLE") \
.config("spark.hadoop.fs.s3a.secret.key", "wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY") \
.config("spark.hadoop.fs.s3a.path.style.access", "true")

spark-shell --conf spark.sql.catalog.lakefs="org.apache.iceberg.spark.SparkCatalog" \
    --conf spark.sql.catalog.lakefs.catalog-impl="io.lakefs.iceberg.LakeFSCatalog" \
    --conf spark.sql.catalog.lakefs.warehouse="lakefs://example-repo" \
    --conf spark.sql.catalog.lakefs.cache-enabled="false" \
    --conf spark.hadoop.fs.s3.impl="org.apache.hadoop.fs.s3a.S3AFileSystem" \
    --conf spark.hadoop.fs.s3a.endpoint="https://example-org.us-east-1.lakefscloud.io" \
    --conf spark.hadoop.fs.s3a.access.key="AKIAIO5FODNN7EXAMPLE" \
    --conf spark.hadoop.fs.s3a.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \
    --conf spark.hadoop.fs.s3a.path.style.access="true"

Using Iceberg tables with HadoopCatalog

Create a table

To create a table on your main branch, use the following syntax:

CREATE TABLE lakefs.main.db1.table1 (id int, data string);

Insert data into the table

INSERT INTO lakefs.main.db1.table1 VALUES (1, 'data1');
INSERT INTO lakefs.main.db1.table1 VALUES (2, 'data2');

Create a branch

We can now commit the creation of the table to the main branch:

lakectl commit lakefs://example-repo/main -m "my first iceberg commit"

Then, create a branch:

lakectl branch create lakefs://example-repo/dev -s lakefs://example-repo/main

Make changes on the branch

We can now make changes on the branch:

INSERT INTO lakefs.dev.db1.table1 VALUES (3, 'data3');

Query the table

If we query the table on the branch, we will see the data we inserted:

SELECT * FROM lakefs.dev.db1.table1;

Results in:

+----+------+
| id | data |
+----+------+
| 1  | data1|
| 2  | data2|
| 3  | data3|
+----+------+

However, if we query the table on the main branch, we will not see the new changes:

SELECT * FROM lakefs.main.db1.table1;