Using lakeFS with Apache Iceberg¶
lakeFS Iceberg REST Catalog¶
Info
Available in lakeFS Enterprise
Tip
lakeFS Iceberg REST Catalog is currently in private preview for lakeFS Enterprise customers. Contact us to get started!
What is lakeFS Iceberg REST Catalog?¶
lakeFS Iceberg REST Catalog allow you to use lakeFS as a spec-compliant Apache Iceberg REST catalog, allowing Iceberg clients to manage and access tables using a standard REST API.
Using lakeFS Iceberg REST Catalog, you can use lakeFS a drop-in replacement for other Iceberg catalogs like AWS Glue, Nessie, Hive Metastore - or the lakeFS HadoopCatalog (see below)
With lakeFS Iceberg REST Catalog, you can:
- Manage Iceberg tables with full version control capabilities.
- Use standard Iceberg clients and tools without modification.
- Leverage lakeFS's branching and merging features for managing table's lifecycle.
- Maintain data consistency across different environments.
Use Cases¶
- Version-Controlled Data Development:
- Create feature branches for table schema changes or data migrations
- Test modifications in isolation, across multiple tables
- Merge changes safely with conflict detection
- Multi-Environment Management:
- Use branches to represent different environments (dev, staging, prod)
- Promote changes between environments through merges, with automated testing
- Maintain consistent table schemas across environments
- Collaborative Data Development:
- Multiple teams can work on different table features simultaneously
- Maintain data quality through pre-merge validations
- Collaborate using pull requests on changes to data and schema
- Manage and Govern Access to data:
- Use the detailed built-in commit log capturing who, what and how data is changed
- Manage access using fine grained access control to users and groups using RBAC policies
- Rollback changes atomically and safely to reduce time-to-recover and increase system stability
Configuration¶
The Iceberg REST catalog API is exposed at /iceberg/api
in your lakeFS server.
To use it:
- Enable the feature (contact us for details).
- Configure your Iceberg clients to use the lakeFS REST catalog endpoint.
- Use your lakeFS access key and secret for authentication.
Catalog Initialization Example (using pyiceberg
)¶
from pyiceberg.catalog.rest import RestCatalog
catalog = RestCatalog(name = "my_catalog", **{
'prefix': 'lakefs',
'uri': f'{lakefs_endpoint}/iceberg/api',
'oauth2-server-uri': f'{lakefs_endpoint}/iceberg/api/v1/oauth/tokens',
'credential': f'{lakefs_client_key}:{lakefs_client_secret}',
})
Example Client code¶
import lakefs
from pyiceberg.catalog import load_catalog
# Initialize the catalog
catalog = RestCatalog(name = "my_catalog", **{
'prefix': 'lakefs',
'uri': 'https://lakefs.example.com/iceberg/api',
'oauth2-server-uri': 'https://lakefs.example.com/iceberg/api/iceberg/api/v1/oauth/tokens',
'credential': f'AKIAlakefs12345EXAMPLE:abc/lakefs/1234567bPxRfiCYEXAMPLEKEY',
})
# List namespaces in a branch
catalog.list_namespaces(('repo', 'main'))
# Query a table
catalog.list_tables('repo.main.inventory')
table = catalog.load_table('repo.main.inventory.books')
arrow_df = table.scan().to_arrow()
// Configure Spark to use the lakeFS REST catalog
spark.sql("USE my_repo.main.inventory")
// List available tables
spark.sql("SHOW TABLES").show()
// Query data with branch isolation
spark.sql("SELECT * FROM books").show()
// Switch to a feature branch
spark.sql("USE my_repo.new_branch.inventory")
spark.sql("SELECT * FROM books").show()
Namespaces and Tables¶
Namespace Operations¶
The Iceberg Catalog supports Iceberg namespace operations:
- Create namespaces
- List namespaces
- Drop namespaces
- List tables within namespaces
Namespace Usage¶
Namespaces in the Iceberg Catalog follow the pattern "<repository>.<branch>.<namespace>(.<namespace>...)"
where:
<repository>
must be a valid lakeFS repository name.<branch>
must be a valid lakeFS branch name.<namespace>
components can be nested using unit separator (e.g.,inventory.books
).
Examples:
my-repo.main.inventory
my-repo.feature-branch.inventory.books
The repository and branch components must already exist in lakeFS before using them in the Iceberg catalog.
Namespace Restrictions¶
- Repository and branch names must follow lakeFS naming conventions.
- Namespace components cannot contain special characters except dots (.) for nesting.
- The total namespace path length must be less than 255 characters.
- Namespaces are case-sensitive.
- Empty namespace components are not allowed.
Table Operations¶
The Iceberg Catalog supports all standard Iceberg table operations:
- Create tables with schemas and partitioning.
- Update table schemas and partitioning.
- Commit changes to tables.
- Delete tables.
- List tables in namespaces.
Version Control Features¶
The Iceberg Catalog integrates with lakeFS's version control system, treating each table change as a commit. This provides a complete history of table modifications and enables branching and merging workflows.
Catalog Changes as Commits¶
Each modification to a table (schema changes, data updates, etc.) creates a new commit in lakeFS. Creating or deleting a namespace or a table results in a lakeFS commit on the relevant branch, as well as table data updates ("Iceberg table commit").
Branching and Merging¶
Create a new branch to work on table changes:
# Create a lakeFS branch using lakeFS Python SDK
branch = lakefs.repository('repo').branch('new_branch').create(source_reference='main')
# The table is now accessible in the new branch
new_table = catalog.load_table(f'repo.{branch.id}.inventory.books')
Merge changes between branches:
# Merge the branch using lakeFS Python SDK
branch.merge_into('main')
# Changes are now visible in main
main_table = catalog.load_table('repo.main.inventory.books')
Info
Currently, lakeFS handles table changes as file operations during merges.
This means that when merging branches with table changes, lakeFS treats the table metadata files as regular files.
No special merge logic is applied to handle conflicting table changes, and if there are conflicting changes to the same table in different branches, the merge will fail with a conflict that needs to be resolved manually.
Authentication¶
lakeFS provides an OAuth2 token endpoint at /catalog/iceberg/v1/oauth/tokens
that clients need to configure.
To authenticate, clients must provide their lakeFS access key and secret in the format access_key:secret
as the credential.
The authorization requirements are managed at the lakeFS level, meaning:
- Users need appropriate lakeFS permissions to access repositories and branches
- Table operations require lakeFS permissions on the underlying objects
- The same lakeFS RBAC policies apply to Iceberg catalog operations
Limitations¶
- Table Maintenance:
- See Table Maintenance section for details
- Advanced Features:
- Views (all view operations are unsupported)
- Transactional DML (
stage-create
) - Server-side query planning
- Table renaming
- Updating table's location (using Commit)
- Table statistics (
set-statistics
andremove-statistics
operations are currently a no-op)
- lakeFS Iceberg REST Catalog is currently tested to work with Amazon S3 and Google Cloud Storage. Other storage backends, such as Azure or Local storage are currently not supported, but will be in future releases.
- Currently only Iceberg
v2
table format is supported
Table Maintenance¶
The following table maintenance operations are not supported in the current version:
- Drop table with purge
- Compact data files
- Rewrite manifests
- Expire snapshots
- Remove old metadata files
- Delete orphan files
Danger
To prevent data loss, clients should disable their own cleanup operations by:
- Disabling orphan file deletion.
- Setting
remove-dangling-deletes
to false when rewriting. - Disabling snapshot expiration.
- Setting a very high value for
min-snapshots-to-keep
parameter.
Roadmap¶
The following features are planned for future releases:
- Catalog Sync:
- Support for pushing/pulling tables to/from other catalogs
- Integration with AWS Glue and other Iceberg-compatible catalogs
- Table Import:
- Support for importing existing Iceberg tables from other catalogs
- Bulk import capabilities for large-scale migrations
- Azure Storage Support
- Advanced Features:
- Views API support
- Table transactions
- Advanced versioning capabilities
- merge non-conflicting table updates
How it works¶
Under the hood, the lakeFS Iceberg REST Catalog keeps track of each table's metadata file. This is typically referred to as the table pointer.
This pointer is stored inside the repository's storage namespace.
When a request is made, the catalog would examine the table's fully qualified namespace: <repository>.<reference>.<namespace>.<table_name>
to read that special pointer file from the given reference specified,
and returns the underlying object store location of the metadata file to the client. When a table is created or updated, lakeFS would make sure to generate a new metadata file inside the storage namespace, and register
that metadata file as the current pointer for the requested branch.
This approach leverages Iceberg's existing metadata and the immutability of its snapshots: a commit in lakeFS captures a metadata file, which in turn captures manifest lists, manifest files and all related data files.
Besides simply avoiding "double booking" where both Iceberg and lakeFS would need to keep track of which files belong to which version, it also greatly improves the scalability and compatibility of the catalog with the existing Iceberg tool ecosystem.
Example: Reading an Iceberg Table¶
Here's a simplified example of what reading from an Iceberg table would look like:
sequenceDiagram
Actor Iceberg Client
participant lakeFS Catalog API
participant lakeFS
participant Object Store
Iceberg Client->>lakeFS Catalog API: get table metadata("repo.branch.table")
lakeFS Catalog API->>lakeFS: get('repo', 'branch', 'table')
lakeFS->>lakeFS Catalog API: physical_address
lakeFS Catalog API->>Iceberg Client: object location ("s3://.../metadata.json")
Iceberg Client->>Object Store: GetObject
Object Store->>Iceberg Client: table data
Example: Writing an Iceberg Table¶
Here's a simplified example of what writing to an Iceberg table would look like:
sequenceDiagram
Actor Iceberg Client
participant lakeFS Catalog API
participant lakeFS
participant Object Store
Iceberg Client->>Object Store: table data
Iceberg Client->>lakeFS Catalog API: commit
lakeFS Catalog API->>lakeFS: create branch("branch-tx-uuid")
lakeFS Catalog API->>lakeFS: put('new table pointer')
lakeFS->>Object Store: PutObject("metdata.json")
lakeFS Catalog API->>lakeFS: merge('branch-tx-uuid', 'branch)
lakeFS Catalog API->>Iceberg Client: done
Related Resources¶
Further Reading
Deprecated: Iceberg HadoopCatalog¶
Warning
HadoopCatalog
and other filesystem-based catalogs are currently not recommended by the Apache Iceberg community and come with several limitations around concurrency and tooling.
As such, the HadoopCatalog
described in this section is now deprecated and will not receive further updates
Setup
Use the following Maven dependency to install the lakeFS custom catalog:
Configuration
Set up the Spark SQL catalog:
.config("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") \
.config("spark.sql.catalog.lakefs.warehouse", f"lakefs://{repo_name}") \
.config("spark.sql.catalog.lakefs.cache-enabled", "false")
Configure the S3A Hadoop FileSystem with your lakeFS connection details. Note that these are your lakeFS endpoint and credentials, not your S3 ones.
.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.endpoint", "https://example-org.us-east-1.lakefscloud.io") \
.config("spark.hadoop.fs.s3a.access.key", "AKIAIO5FODNN7EXAMPLE") \
.config("spark.hadoop.fs.s3a.secret.key", "wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY") \
.config("spark.hadoop.fs.s3a.path.style.access", "true")
spark-shell --conf spark.sql.catalog.lakefs="org.apache.iceberg.spark.SparkCatalog" \
--conf spark.sql.catalog.lakefs.catalog-impl="io.lakefs.iceberg.LakeFSCatalog" \
--conf spark.sql.catalog.lakefs.warehouse="lakefs://example-repo" \
--conf spark.sql.catalog.lakefs.cache-enabled="false" \
--conf spark.hadoop.fs.s3.impl="org.apache.hadoop.fs.s3a.S3AFileSystem" \
--conf spark.hadoop.fs.s3a.endpoint="https://example-org.us-east-1.lakefscloud.io" \
--conf spark.hadoop.fs.s3a.access.key="AKIAIO5FODNN7EXAMPLE" \
--conf spark.hadoop.fs.s3a.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \
--conf spark.hadoop.fs.s3a.path.style.access="true"
Using Iceberg tables with HadoopCatalog
Create a table
To create a table on your main branch, use the following syntax:
Insert data into the table
INSERT INTO lakefs.main.db1.table1 VALUES (1, 'data1');
INSERT INTO lakefs.main.db1.table1 VALUES (2, 'data2');
Create a branch
We can now commit the creation of the table to the main branch:
Then, create a branch:
Make changes on the branch
We can now make changes on the branch:
Query the table
If we query the table on the branch, we will see the data we inserted:
Results in:
However, if we query the table on the main branch, we will not see the new changes:
Results in:
Migrating an existing Iceberg table to the Hadoop Catalog
This is done through an incremental copy from the original table into lakeFS.
- Create a new lakeFS repository
lakectl repo create lakefs://example-repo <base storage path>
-
Initiate a spark session that can interact with the source iceberg table and the target lakeFS catalog. Here's an example of Hadoop and S3 session and lakeFS catalog with per-bucket config:
3. Create Schema in lakeFS and copy the dataSparkConf conf = new SparkConf(); conf.set("spark.hadoop.fs.s3a.path.style.access", "true"); // set hadoop on S3 config (source tables we want to copy) for spark conf.set("spark.sql.catalog.hadoop_prod", "org.apache.iceberg.spark.SparkCatalog"); conf.set("spark.sql.catalog.hadoop_prod.type", "hadoop"); conf.set("spark.sql.catalog.hadoop_prod.warehouse", "s3a://my-bucket/warehouse/hadoop/"); conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"); conf.set("spark.hadoop.fs.s3a.bucket.my-bucket.access.key", "<AWS_ACCESS_KEY>"); conf.set("spark.hadoop.fs.s3a.bucket.my-bucket.secret.key", "<AWS_SECRET_KEY>"); // set lakeFS config (target catalog and repository) conf.set("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog"); conf.set("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog"); conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://example-repo"); conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"); conf.set("spark.hadoop.fs.s3a.bucket.example-repo.access.key", "<LAKEFS_ACCESS_KEY>"); conf.set("spark.hadoop.fs.s3a.bucket.example-repo.secret.key", "<LAKEFS_SECRET_KEY>"); conf.set("spark.hadoop.fs.s3a.bucket.example-repo.endpoint" , "<LAKEFS_ENDPOINT>");
Example of copy with spark-sql: