Using lakeFS with the AWS Glue Data Catalog¶
Info
Available in lakeFS Enterprise
Tip
This integration requires the lakeFS Iceberg REST Catalog to be enabled. Contact us to get started!
Overview¶
Using AWS Glue Catalog Federation, you can expose lakeFS-managed Apache Iceberg tables to any AWS service that reads from the Glue Data Catalog -- including Amazon Athena, Redshift Spectrum, and EMR.
The lakefs-glue CLI tool automates the creation of a federated Glue catalog that connects directly to the lakeFS Iceberg REST Catalog. Table metadata is discovered in real time through lakeFS -- no data copying or metadata syncing required.
How It Works¶
- A query engine (e.g. Athena) submits a SQL query referencing a table in the federated Glue Data Catalog.
- Glue connects to the lakeFS Iceberg REST Catalog via a REST API connection and retrieves table metadata.
- Lake Formation issues temporary, scoped S3 credentials to the query engine.
- The query engine reads Iceberg data files directly from S3.
Prerequisites¶
- A lakeFS instance with the Iceberg REST Catalog enabled.
- A lakeFS service account (access key and secret key).
- Python 3.11+ (for running the
lakefs-glueCLI). - AWS credentials with permissions to manage IAM, Glue, Lake Formation, and Secrets Manager.
Setting Up a Federated Catalog¶
Install the CLI¶
The easiest way to run the tool is with uv -- no install needed:
Create a Federated Catalog¶
Run the federate command to create a Glue federated catalog that points to a lakeFS repository and ref:
This creates a federated catalog named lakefs-catalog (the default) pointing to the main branch.
Configuration Options¶
| Option | Required | Default | Description |
|---|---|---|---|
--lakefs-url |
Yes | lakeFS server URL | |
--lakefs-repo |
Yes | Repository name | |
--lakefs-ref |
No | main |
Branch, tag, or commit ID to expose |
--lakefs-access-key-id |
Yes | Service account access key | |
--lakefs-secret-access-key |
Yes | Service account secret key | |
--catalog-name |
No | lakefs-catalog |
Name for the Glue federated catalog |
--region |
No | us-east-1 |
AWS region |
--grant-to |
No | IAM ARNs to grant catalog access to (repeatable) |
What Gets Created¶
The federate command creates the following AWS resources:
- Secrets Manager secret (
<catalog-name>-secret) -- stores lakeFS credentials for OAuth2 authentication. - IAM role (
<catalog-name>-GlueConnectionRole) -- assumed by Glue and Lake Formation. - Glue Connection (
<catalog-name>-connection) -- the REST API bridge to lakeFS. - Lake Formation resource -- enables S3 credential vending.
- Glue Catalog (
<catalog-name>) -- the federated catalog visible in Athena and other AWS services. - Lake Formation grants -- permissions for any principals specified with
--grant-to.
The command is idempotent: running it again with different parameters updates the existing resources.
Working with Multiple Branches and Refs¶
Each federated catalog points to a single lakeFS ref. To query multiple branches, tags, or commits, create a separate catalog for each:
# Main branch
uvx lakefs-glue federate \
--lakefs-repo my-repo \
--lakefs-ref main \
--catalog-name my-repo-main \
--lakefs-url https://my-org.us-east-1.lakefscloud.io \
--lakefs-access-key-id $LAKEFS_ACCESS_KEY_ID \
--lakefs-secret-access-key $LAKEFS_SECRET_ACCESS_KEY
# Development branch
uvx lakefs-glue federate \
--lakefs-repo my-repo \
--lakefs-ref dev \
--catalog-name my-repo-dev \
--lakefs-url https://my-org.us-east-1.lakefscloud.io \
--lakefs-access-key-id $LAKEFS_ACCESS_KEY_ID \
--lakefs-secret-access-key $LAKEFS_SECRET_ACCESS_KEY
# A specific tag
uvx lakefs-glue federate \
--lakefs-repo my-repo \
--lakefs-ref v1.0 \
--catalog-name my-repo-v1 \
--lakefs-url https://my-org.us-east-1.lakefscloud.io \
--lakefs-access-key-id $LAKEFS_ACCESS_KEY_ID \
--lakefs-secret-access-key $LAKEFS_SECRET_ACCESS_KEY
# Main branch
lakefs-glue federate \
--lakefs-repo my-repo \
--lakefs-ref main \
--catalog-name my-repo-main \
--lakefs-url https://my-org.us-east-1.lakefscloud.io \
--lakefs-access-key-id $LAKEFS_ACCESS_KEY_ID \
--lakefs-secret-access-key $LAKEFS_SECRET_ACCESS_KEY
# Development branch
lakefs-glue federate \
--lakefs-repo my-repo \
--lakefs-ref dev \
--catalog-name my-repo-dev \
--lakefs-url https://my-org.us-east-1.lakefscloud.io \
--lakefs-access-key-id $LAKEFS_ACCESS_KEY_ID \
--lakefs-secret-access-key $LAKEFS_SECRET_ACCESS_KEY
# A specific tag
lakefs-glue federate \
--lakefs-repo my-repo \
--lakefs-ref v1.0 \
--catalog-name my-repo-v1 \
--lakefs-url https://my-org.us-east-1.lakefscloud.io \
--lakefs-access-key-id $LAKEFS_ACCESS_KEY_ID \
--lakefs-secret-access-key $LAKEFS_SECRET_ACCESS_KEY
Each catalog appears independently in Athena and Lake Formation.
Querying the Federated Catalog¶
Once set up, you can query your lakeFS tables from any AWS service that integrates with Glue Data Catalog. For a detailed guide with query examples, see Using lakeFS with AWS Glue & Amazon Athena.
Removing Federated Catalogs¶
To clean up federated catalogs and their associated AWS resources:
Limitations¶
- Read-only: AWS Glue Catalog Federation only supports read queries.
INSERT,CREATE TABLE, and other write operations are not supported. - Single ref per catalog: Each federated catalog points to one lakeFS ref. Create multiple catalogs to query multiple branches or tags.
- Flat namespaces only: AWS Glue Catalog Federation supports only flat
catalog.namespace.tablestructures -- nested namespaces are not supported.