Using lakeFS with Apache Iceberg¶

lakeFS Iceberg REST Catalog¶

Info

Available in lakeFS Enterprise

Tip

lakeFS Iceberg REST Catalog is currently in private preview for lakeFS Enterprise customers. Contact us to get started!

What is lakeFS Iceberg REST Catalog?¶

lakeFS Iceberg REST Catalog allow you to use lakeFS as a spec-compliant Apache Iceberg REST catalog, allowing Iceberg clients to manage and access tables using a standard REST API.

Using lakeFS Iceberg REST Catalog, you can use lakeFS a drop-in replacement for other Iceberg catalogs like AWS Glue, Nessie, Hive Metastore - or the lakeFS HadoopCatalog (see below)

With lakeFS Iceberg REST Catalog, you can:

Manage Iceberg tables with full version control capabilities.
Use standard Iceberg clients and tools without modification.
Leverage lakeFS's branching and merging features for managing table's lifecycle.
Maintain data consistency across different environments.

Use Cases¶

Version-Controlled Data Development:
- Create feature branches for table schema changes or data migrations
- Test modifications in isolation, across multiple tables
- Merge changes safely with conflict detection
Multi-Environment Management:
- Use branches to represent different environments (dev, staging, prod)
- Promote changes between environments through merges, with automated testing
- Maintain consistent table schemas across environments
Collaborative Data Development:
- Multiple teams can work on different table features simultaneously
- Maintain data quality through pre-merge validations
- Collaborate using pull requests on changes to data and schema
Manage and Govern Access to data:
- Use the detailed built-in commit log capturing who, what and how data is changed
- Manage access using fine grained access control to users and groups using RBAC policies
- Rollback changes atomically and safely to reduce time-to-recover and increase system stability

Configuration¶

The Iceberg REST catalog API is exposed at /iceberg/api in your lakeFS server.

To use it:

Enable the feature (contact us for details).
Configure your Iceberg clients to use the lakeFS REST catalog endpoint.
Use your lakeFS access key and secret for authentication.

Catalog Initialization Example (using `pyiceberg`)¶

from pyiceberg.catalog.rest import RestCatalog

catalog = RestCatalog(name = "my_catalog", **{
    'prefix': 'lakefs',
    'uri': f'{lakefs_endpoint}/iceberg/api',
    'oauth2-server-uri': f'{lakefs_endpoint}/iceberg/api/v1/oauth/tokens',
    'credential': f'{lakefs_client_key}:{lakefs_client_secret}',
})

Example Client code¶

Python (PyIceberg)TrinoSpark

import lakefs
from pyiceberg.catalog import load_catalog

# Initialize the catalog
catalog = RestCatalog(name = "my_catalog", **{
    'prefix': 'lakefs',
    'uri': 'https://lakefs.example.com/iceberg/api',
    'oauth2-server-uri': 'https://lakefs.example.com/iceberg/api/iceberg/api/v1/oauth/tokens',
    'credential': f'AKIAlakefs12345EXAMPLE:abc/lakefs/1234567bPxRfiCYEXAMPLEKEY',
})

# List namespaces in a branch
catalog.list_namespaces(('repo', 'main'))

# Query a table
catalog.list_tables('repo.main.inventory')
table = catalog.load_table('repo.main.inventory.books')
arrow_df = table.scan().to_arrow()

-- List tables in the iceberg catalog
USE "repo.main.inventory"; -- <repository>.<branch or reference>.<namespace>
SHOW TABLES;

-- Query a table
SELECT * FROM books LIMIT 100;

-- Switch to a different branch
USE "repo.new_branch.inventory";
SELECT * FROM books;

// Configure Spark to use the lakeFS REST catalog
spark.sql("USE my_repo.main.inventory")

// List available tables
spark.sql("SHOW TABLES").show()

// Query data with branch isolation
spark.sql("SELECT * FROM books").show()

// Switch to a feature branch
spark.sql("USE my_repo.new_branch.inventory")
spark.sql("SELECT * FROM books").show()

Namespaces and Tables¶

Namespace Operations¶

The Iceberg Catalog supports Iceberg namespace operations:

Create namespaces
List namespaces
Drop namespaces
List tables within namespaces

Namespace Usage¶

Namespaces in the Iceberg Catalog follow the pattern "<repository>.<branch>.<namespace>(.<namespace>...)" where:

<repository> must be a valid lakeFS repository name.
<branch> must be a valid lakeFS branch name.
<namespace> components can be nested using unit separator (e.g., inventory.books).

Examples:

my-repo.main.inventory
my-repo.feature-branch.inventory.books

The repository and branch components must already exist in lakeFS before using them in the Iceberg catalog.

Relative Namespace support¶

Some Apache Iceberg clients do not support nested namespaces.

To support those, the lakeFS REST Catalog allows specifying relative namespaces: passing a partial namespace as part of the catalog URL endpoint (commonly, <repository>.<branch>).

By doing so, all namespaces passed by the user will be relative to the namespaces passed in the URL.

Example: DuckDB

DuckDB allows limited nesting in the form of <database>.<schema>. When using the Iceberg REST Catalog integration, the database is replaced by the name given to the catalog. We can use relative namespaces to allow scoping the catalog connecting to a specific repository and branch:

LOAD iceberg;
LOAD httpfs;

CREATE SECRET lakefs_credentials (
    TYPE ICEBERG,
    CLIENT_ID '...',
    CLIENT_SECRET '...',
    OAUTH2_SERVER_URI 'https://lakefs.example.com/iceberg/api/v1/oauth/tokens'
);

ATTACH 'lakefs' AS main_branch (
    TYPE iceberg,
    SECRET lakefs_credentials,
    -- notice the "/relative_to/.../" part:
    ENDPOINT 'https://lakefs.example.com/iceberg/relative_to/my-repo.main/api'
);

USE main_branch.inventory;
SELECT * FROM books;

See the DuckDB documentation for a full reference on how to setup an Iceberg catalog integration.

Namespace Restrictions¶

Repository and branch names must follow lakeFS naming conventions.
Namespace components cannot contain special characters except dots (.) for nesting.
The total namespace path length must be less than 255 characters.
Namespaces are case-sensitive.
Empty namespace components are not allowed.

Table Operations¶

The Iceberg Catalog supports all standard Iceberg table operations:

Create tables with schemas and partitioning.
Update table schemas and partitioning.
Commit changes to tables.
Delete tables.
List tables in namespaces.

Version Control Features¶

The Iceberg Catalog integrates with lakeFS's version control system. Table and namespace changes made through the Iceberg API are staged on the target lakeFS reference, and may be committed to lakeFS to record these staged changes. This provides a complete history of table modifications and enables lakeFS branching and merging workflows.

Catalog Changes in lakeFS¶

Iceberg changes (create/drop/commit to table, create/drop namespace) are reflected as staging changes on the relevant lakeFS (table/namespace) object. These changes aren't committed to lakeFS, and should be explicitly committed by using lakeFS tools.

Note that several Iceberg operations can be grouped across multiple tables into a single lakeFS commit.

Branching and Merging¶

Create a new branch to work on table changes:

# Create a lakeFS branch using lakeFS Python SDK
branch = lakefs.repository('repo').branch('new_branch').create(source_reference='main')

# The table is now accessible in the new branch
new_table = catalog.load_table(f'repo.{branch.id}.inventory.books')

Merge changes between branches:

# Merge the branch using lakeFS Python SDK
branch.merge_into('main')

# Changes are now visible in main
main_table = catalog.load_table('repo.main.inventory.books')

Info

lakeFS handles table changes as file operations during merges.

This means that when merging branches with table changes, lakeFS treats the table metadata files as regular files.

No special merge logic is applied to handle conflicting table changes, and if there are conflicting changes to the same table in different branches, the merge will fail with a conflict that needs to be resolved manually.

Authentication¶

lakeFS provides an OAuth2 token endpoint at /catalog/iceberg/v1/oauth/tokens that clients need to configure. To authenticate, clients must provide their lakeFS access key and secret in the format access_key:secret as the credential.

Authorization¶

The authorization requirements are managed at the lakeFS level, meaning:

Users need appropriate lakeFS permissions to access repositories and branches
Table operations require lakeFS permissions on the underlying objects
The same lakeFS RBAC policies apply to Iceberg catalog operations

Limitations¶

Table Maintenance:
- See Table Maintenance section for details
Advanced Features:
- Views (all view operations are unsupported)
- Transactional DML (stage-create)
- Server-side query planning
- Table renaming
- Updating table's location (using Commit)
- Table statistics (set-statistics and remove-statistics operations are currently a no-op)
lakeFS Iceberg REST Catalog is currently tested to work with Amazon S3 and Google Cloud Storage. Other storage backends, such as Azure or Local storage are currently not supported, but will be in future releases.
Currently only Iceberg v2 table format is supported

Table Maintenance¶

Compact Data Files¶

In the lakeFS catalog, data integrity is maintained after data file compaction operations (such as RewriteDataFiles), since data files are not deleted by such operations (only new snapshot and manifest files are created).

However, to avoid unnecessary merge conflicts, we recommend marking compaction operations using snapshotProperty(), so that lakeFS can automatically resolve conflicts when merging branches with compaction commits.

Requirements:

Use the Spark Java API (not SQL procedures) with Iceberg v1.5 or newer
Mark compaction operations with the following property and value using snapshotProperty():

SparkActions.get(spark)
    .rewriteDataFiles(table)
    .snapshotProperty("lakefs.compaction.operation-id", "rewrite-data-files")
    .execute();

Conflict Resolution:

When merging branches, lakeFS automatically resolves conflicts if both branches have compaction commits for the same table and at most one branch has non-compaction data changes. This allows safe merging of compacted branches without losing work.

Info

Expired or deleted snapshots will very likely lead to merge conflicts, as lakeFS cannot determine which snapshots to keep.

Tip

Frequently merge compacted branches to minimize merge conflicts.

Tip

Data changes are ignored in non-"main" Iceberg branches ("Refs"), so it's advised to avoid branching in Iceberg when branching in lakeFS.

For disabling the automatic conflict resolution, set the iceberg.ignore_compaction_commits configuration flag to false (it defaults to true).

Unsupported Operations¶

The following table maintenance operations are not supported in the current version:

Danger

To prevent data loss, clients should disable their own cleanup operations by:

Disabling orphan file deletion.
Setting remove-dangling-deletes to false when rewriting.
Disabling snapshot expiration.
Setting a very high value for min-snapshots-to-keep parameter.

Roadmap¶

The following features are planned for future releases:

Table Import:
- Support for importing existing Iceberg tables from other catalogs
- Bulk import capabilities for large-scale migrations
Azure Storage Support
Advanced Features:
- Views API support
- Table transactions
Advanced versioning capabilities
- merge non-conflicting table updates

How it works¶

Under the hood, the lakeFS Iceberg REST Catalog keeps track of each table's metadata file. This is typically referred to as the table pointer.

This pointer is stored inside the repository's storage namespace.

When a request is made, the catalog would examine the table's fully qualified namespace: <repository>.<reference>.<namespace>.<table_name> to read that special pointer file from the given reference specified, and returns the underlying object store location of the metadata file to the client. When a table is created or updated, lakeFS would make sure to generate a new metadata file inside the storage namespace, and register that metadata file as the current pointer for the requested branch. These changes are staged in lakeFS, and may be committed to the branch.

This approach leverages Iceberg's existing metadata and the immutability of its snapshots: a commit in lakeFS captures a metadata file, which in turn captures manifest lists, manifest files and all related data files.

Besides simply avoiding "double booking" where both Iceberg and lakeFS would need to keep track of which files belong to which version, it also greatly improves the scalability and compatibility of the catalog with the existing Iceberg tool ecosystem.

Example: Reading an Iceberg Table¶

Here's a simplified example of what reading from an Iceberg table would look like:

sequenceDiagram
    Actor Iceberg Client
    participant lakeFS Catalog API
    participant lakeFS
    participant Object Store

    Iceberg Client->>lakeFS Catalog API: get table metadata("repo.branch.table")
    lakeFS Catalog API->>lakeFS: get('repo', 'branch', 'table')
    lakeFS->>lakeFS Catalog API: physical_address
    lakeFS Catalog API->>Iceberg Client: object location ("s3://.../metadata.json")
    Iceberg Client->>Object Store: GetObject
    Object Store->>Iceberg Client: table data

Example: Writing an Iceberg Table¶

Here's a simplified example of what writing to an Iceberg table would look like:

sequenceDiagram
    Actor Iceberg Client
    participant lakeFS Catalog API
    participant lakeFS
    participant Object Store
    Iceberg Client->>Object Store: write table data
    Iceberg Client->>lakeFS Catalog API: Iceberg commit
    lakeFS Catalog API->>lakeFS: stage('new table pointer')
    lakeFS->>Object Store: PutObject("metdata.json")
    lakeFS Catalog API->>Iceberg Client: Iceberg commit done
    Iceberg Client->>lakeFS: lakeFS commit('repo','branch', message)

Catalog Sync¶

Catalog Sync allows you to synchronize Iceberg tables between the lakeFS catalog and external catalogs such as AWS Glue Data Catalog or other Iceberg REST-compatible catalogs. This enables workflows where some users or tools need to access data through other external catalogs.

Use Cases¶

Collaboration with External Tools: Share data with users who rely on tools that only support specific catalogs. For example, data engineers can work with tables in lakeFS while data analysts query the same data through AWS Glue using Athena, all while maintaining isolation and version control.

sequenceDiagram
    actor Alice (Data Engineer)
    actor Bob (Data Analyst)
    participant lakeFS
    participant AWS Glue
    Alice (Data Engineer)->>lakeFS: Update tables on branch 'dev'
    Alice (Data Engineer)->>lakeFS: Push tables to Glue
    lakeFS->>AWS Glue: Register tables in 'dev' database
    Alice (Data Engineer)->>Bob (Data Analyst): Please review: glue/dev
    Bob (Data Analyst)->>AWS Glue: SELECT * FROM dev.table (via Athena)
    Bob (Data Analyst)->>Alice (Data Engineer): Approved!
    Alice (Data Engineer)->>lakeFS: Merge 'dev' into 'main'

Isolated Pipelines: Run data pipelines using tools that require external catalogs while maintaining isolation through lakeFS branches. Create a branch, push tables to an external catalog, run your pipeline, pull the changes back, and merge into main.

sequenceDiagram
    participant Orchestrator
    participant lakeFS
    participant External Catalog
    participant Pipeline Tool
    Orchestrator->>lakeFS: Create branch 'etl-2024-01-15'
    Orchestrator->>lakeFS: Push tables to external catalog
    lakeFS->>External Catalog: Register tables
    Orchestrator->>Pipeline Tool: Run ETL pipeline
    Pipeline Tool->>External Catalog: Read/write tables
    Orchestrator->>lakeFS: Pull updated tables
    lakeFS->>External Catalog: Read updated metadata
    Orchestrator->>lakeFS: Merge 'etl-2024-01-15' into 'main'

Configuration¶

Remote catalogs are configured in your lakeFS configuration file. Each remote catalog requires a unique identifier and type-specific connection properties.

Note

In case you need help configuring a remote catalog, contact support.

AWS Glue Data Catalog¶

Configure an AWS Glue catalog by specifying the region and AWS credentials:

iceberg_catalog:
  remotes:
    - id: aws_glue_us_east_1
      type: glue
      glue:
        region: us-east-1
        access_key_id: <your-glue-key>
        secret_access_key: <your-glue-secret>

Iceberg REST Catalog¶

Configure a generic Iceberg REST catalog with basic authentication:

iceberg_catalog:
  remotes:
    - id: remote_catalog
      type: rest
      rest:
        uri: https://catalog.example.com/iceberg/api
        credential: <client-id>:<client-secret>

Or with OAuth2 client credentials flow:

iceberg_catalog:
  remotes:
    - id: remote_catalog
      type: rest
      rest:
        uri: https://catalog.example.com/iceberg/api
        credential: <client-id>:<client-secret>
        oauth_server_uri: https://auth.example.com/oauth/tokens
        oauth_scope: catalog:read catalog:write

Push to remote¶

Push operations register a table from lakeFS into a remote catalog. The table's metadata and data remain in lakeFS-managed storage, which are used as the pushed table's location.

API Endpoint: POST /iceberg/remotes/{catalog-id}/push

Parameters: - source: The lakeFS table location (repository, branch/reference, namespace, table name) - destination: The remote catalog location (namespace, table name) - force_update: (optional, default: false) Override the table if it already exists in the remote catalog - create_namespace: (optional, default: false) Create the namespace in the remote catalog if it doesn't exist

Example:

curl -X POST "https://lakefs.example.com/iceberg/remotes/aws_glue_us_east_1/push" \
  -H "Authorization: Bearer $LAKEFS_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "repository_id": "my-repo",
      "reference_id": "main",
      "namespace": ["default", "features"],
      "table": "image_properties"
    },
    "destination": {
      "namespace": ["main_features"],
      "table": "image_properties"
    },
    "create_namespace": true,
    "force_update": true
  }'

This example pushes the table my-repo.main.default.features.image_properties from lakeFS to the AWS Glue catalog as main_features.image_properties. It creates the remote namespace if needed (since create_namespace: true), and overwrites any existing table or possible recent updates committed to it (since force_update: true).

Pull from remote¶

Pull operations update a lakeFS table with changes from a remote catalog. This is useful after external tools have modified a table previously pushed from lakeFS.

API Endpoint: POST /iceberg/remotes/{catalog-id}/pull

Parameters: - source: The remote catalog location (namespace, table name) - destination: The lakeFS table location (repository, branch/reference, namespace, table name) - force_update: (optional, default: false) Override the table in lakeFS if metadata conflicts exist - create_namespace: (optional, default: false) Create the namespace in lakeFS if it doesn't exist

Example:

curl -X POST "https://lakefs.example.com/iceberg/remotes/aws_glue_us_east_1/pull" \
  -H "Authorization: Bearer $LAKEFS_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "namespace": ["main_features"],
      "table": "image_properties"
    },
    "destination": {
      "repository_id": "my-repo",
      "reference_id": "main",
      "namespace": ["default", "features"],
      "table": "image_properties"
    }
    "create_namespace": true,
    "force_update": true
  }'

This example pulls changes from the Glue table main_features.image_properties back into the lakeFS table my-repo.main.default.features.image_properties. It creates the namespace in lakeFS if needed (since create_namespace: true), and overwrites any existing table or possible recent updates committed to it (since force_update: true).

Important Notes¶

Storage Location: Pulled tables' metadata.json file must reside in a storage location to which lakeFS has read access to, or the pull operation will fail.
Atomicity: Push and pull operations are not atomic. If an operation fails partway through, manual intervention may be required (contact support in such a case if needed).
Authentication: Ensure the credentials configured for remote catalogs have appropriate permissions:
- For AWS Glue: glue:CreateTable, glue:UpdateTable, glue:GetTable, glue:CreateDatabase (if create_namespace is used)
- For REST catalogs: Appropriate OAuth scopes for table and namespace operations
Namespace Format: Namespaces are represented as arrays of strings to support nested namespaces (e.g., ["accounting", "tax"] represents accounting.tax).

Deprecated: Iceberg HadoopCatalog¶

Warning

HadoopCatalog and other filesystem-based catalogs are currently not recommended by the Apache Iceberg community and come with several limitations around concurrency and tooling.

As such, the HadoopCatalog described in this section is now deprecated and will not receive further updates

Setup

MavenPySpark

Use the following Maven dependency to install the lakeFS custom catalog:

<dependency>
<groupId>io.lakefs</groupId>
<artifactId>lakefs-iceberg</artifactId>
<version>0.1.4</version>
</dependency>

Include the lakefs-iceberg jar in your package list along with Iceberg. For example:

.config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0,io.lakefs:lakefs-iceberg:0.1.4")

Configuration

PySparkSpark Shell

Set up the Spark SQL catalog:

.config("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") \
.config("spark.sql.catalog.lakefs.warehouse", f"lakefs://{repo_name}") \ 
.config("spark.sql.catalog.lakefs.cache-enabled", "false")

Configure the S3A Hadoop FileSystem with your lakeFS connection details. Note that these are your lakeFS endpoint and credentials, not your S3 ones.

.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.endpoint", "https://example-org.us-east-1.lakefscloud.io") \
.config("spark.hadoop.fs.s3a.access.key", "AKIAIO5FODNN7EXAMPLE") \
.config("spark.hadoop.fs.s3a.secret.key", "wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY") \
.config("spark.hadoop.fs.s3a.path.style.access", "true")

spark-shell --conf spark.sql.catalog.lakefs="org.apache.iceberg.spark.SparkCatalog" \
    --conf spark.sql.catalog.lakefs.catalog-impl="io.lakefs.iceberg.LakeFSCatalog" \
    --conf spark.sql.catalog.lakefs.warehouse="lakefs://example-repo" \
    --conf spark.sql.catalog.lakefs.cache-enabled="false" \
    --conf spark.hadoop.fs.s3.impl="org.apache.hadoop.fs.s3a.S3AFileSystem" \
    --conf spark.hadoop.fs.s3a.endpoint="https://example-org.us-east-1.lakefscloud.io" \
    --conf spark.hadoop.fs.s3a.access.key="AKIAIO5FODNN7EXAMPLE" \
    --conf spark.hadoop.fs.s3a.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \
    --conf spark.hadoop.fs.s3a.path.style.access="true"

Using Iceberg tables with HadoopCatalog

Create a table

To create a table on your main branch, use the following syntax:

CREATE TABLE lakefs.main.db1.table1 (id int, data string);

Insert data into the table

INSERT INTO lakefs.main.db1.table1 VALUES (1, 'data1');
INSERT INTO lakefs.main.db1.table1 VALUES (2, 'data2');

Create a branch

We can now commit the creation of the table to the main branch:

lakectl commit lakefs://example-repo/main -m "my first iceberg commit"

Then, create a branch:

lakectl branch create lakefs://example-repo/dev -s lakefs://example-repo/main

Make changes on the branch

We can now make changes on the branch:

INSERT INTO lakefs.dev.db1.table1 VALUES (3, 'data3');

Query the table

If we query the table on the branch, we will see the data we inserted:

SELECT * FROM lakefs.dev.db1.table1;

Results in:

+----+------+
| id | data |
+----+------+
| 1  | data1|
| 2  | data2|
| 3  | data3|
+----+------+

However, if we query the table on the main branch, we will not see the new changes:

SELECT * FROM lakefs.main.db1.table1;

Results in:

+----+------+
| id | data |
+----+------+
| 1  | data1|
| 2  | data2|
+----+------+

Migrating an existing Iceberg table to the Hadoop Catalog

This is done through an incremental copy from the original table into lakeFS.

Create a new lakeFS repository lakectl repo create lakefs://example-repo <base storage path>

Initiate a spark session that can interact with the source iceberg table and the target lakeFS catalog. Here's an example of Hadoop and S3 session and lakeFS catalog with per-bucket config:

SparkConf conf = new SparkConf();
conf.set("spark.hadoop.fs.s3a.path.style.access", "true");

// set hadoop on S3 config (source tables we want to copy) for spark
conf.set("spark.sql.catalog.hadoop_prod", "org.apache.iceberg.spark.SparkCatalog");
conf.set("spark.sql.catalog.hadoop_prod.type", "hadoop");
conf.set("spark.sql.catalog.hadoop_prod.warehouse", "s3a://my-bucket/warehouse/hadoop/");
conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions");
conf.set("spark.hadoop.fs.s3a.bucket.my-bucket.access.key", "<AWS_ACCESS_KEY>");
conf.set("spark.hadoop.fs.s3a.bucket.my-bucket.secret.key", "<AWS_SECRET_KEY>");

// set lakeFS config (target catalog and repository)
conf.set("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog");
conf.set("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog");
conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://example-repo");
conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions");
conf.set("spark.hadoop.fs.s3a.bucket.example-repo.access.key", "<LAKEFS_ACCESS_KEY>");
conf.set("spark.hadoop.fs.s3a.bucket.example-repo.secret.key", "<LAKEFS_SECRET_KEY>");
conf.set("spark.hadoop.fs.s3a.bucket.example-repo.endpoint"  , "<LAKEFS_ENDPOINT>");

3. Create Schema in lakeFS and copy the data

Example of copy with spark-sql:

-- Create Iceberg Schema in lakeFS
CREATE SCHEMA IF NOT EXISTS <lakefs-catalog>.<branch>.<db>
-- Create new iceberg table in lakeFS from the source table (pre-lakeFS)
CREATE TABLE IF NOT EXISTS <lakefs-catalog>.<branch>.<db> USING iceberg AS SELECT * FROM <iceberg-original-table>

Using lakeFS with Apache Iceberg¶

lakeFS Iceberg REST Catalog¶

What is lakeFS Iceberg REST Catalog?¶

Use Cases¶

Configuration¶

Catalog Initialization Example (using pyiceberg)¶

Example Client code¶

Namespaces and Tables¶

Namespace Operations¶

Namespace Usage¶

Relative Namespace support¶

Namespace Restrictions¶

Table Operations¶

Version Control Features¶

Catalog Changes in lakeFS¶

Branching and Merging¶

Authentication¶

Authorization¶

Limitations¶

Table Maintenance¶

Compact Data Files¶

Unsupported Operations¶

Roadmap¶

How it works¶

Example: Reading an Iceberg Table¶

Example: Writing an Iceberg Table¶

Related Resources¶

Catalog Sync¶

Use Cases¶

Configuration¶

AWS Glue Data Catalog¶

Iceberg REST Catalog¶

Push to remote¶

Pull from remote¶

Important Notes¶

Deprecated: Iceberg HadoopCatalog¶

Setup

Configuration

Using Iceberg tables with HadoopCatalog

Create a table

Insert data into the table

Create a branch

Make changes on the branch

Query the table

Migrating an existing Iceberg table to the Hadoop Catalog

Catalog Initialization Example (using `pyiceberg`)¶