Multi-Storage Backend
lakeFS Enterprise
Multi-storage backend support is only available to licensed lakeFS Enterprise customers. Contact us to get started!
What is Multi-storage Backend Support?
lakeFS multi-storage backend support enables seamless data management across multiple storage systems — on-premises, across public clouds, or hybrid environments. This capability makes lakeFS a unified data management platform for all organizational data assets, which is especially critical in AI/ML environments that rely on diverse datasets stored in multiple locations.
With a multi-store setup, lakeFS can connect to and manage any combination of supported storage systems, including:
- AWS S3
- Azure Blob
- Google Cloud Storage
- other S3-compatible storage
- local storage
Multi-storage backends support is available from version v1.51.0 of lakeFS Enterprise.
Use Cases
- Distributed Data Management:
- Eliminate data silos and enable seamless cross-cloud collaboration.
- Maintain version control across different storage providers for consistency and reproducibility.
- Ideal for AI/ML environments where datasets are distributed across multiple storage locations.
- Unified Data Access:
- Access data across multiple storage backends using a single, consistent URI format.
- Centralized Access Control & Governance:
- Access permissions and policies can be centrally managed across all connected storage systems using lakeFS RBAC.
- Compliance and security controls remain consistent, regardless of where the data is stored.
Configuration
To configure your lakeFS server to connect to multiple storage backends, define them under the blockstores
section in your server configurations.
The blockstores.stores
field is an array of storage backends, each with its own configuration.
For a complete list of available options, refer to the server configuration reference.
Note: If you’re upgrading from a single-store lakeFS setup, refer to the upgrade guidelines to ensure a smooth transition.
Example Configurations
This example setup configures lakeFS to manage data across two separate MinIO instances:
blockstores:
signing:
secret_key: "some-secret"
stores:
- id: "minio-prod"
description: "Primary on-prem MinIO storage for production data"
type: "s3"
s3:
force_path_style: true
endpoint: 'http://minio-prod.local'
discover_bucket_region: false
credentials:
access_key_id: "prod_access_key"
secret_access_key: "prod_secret_key"
- id: "minio-backup"
description: "Backup MinIO storage for disaster recovery"
type: "s3"
s3:
force_path_style: true
endpoint: 'http://minio-backup.local'
discover_bucket_region: false
credentials:
access_key_id: "backup_access_key"
secret_access_key: "backup_secret_key"
This example setup configures lakeFS to manage data across two public cloud providers: AWS and Azure:
blockstores:
signing:
secret_key: "some-secret"
stores:
- id: "s3-prod"
description: "AWS S3 storage for production data"
type: "s3"
s3:
region: "us-east-1"
- id: "azure-analytics"
description: "Azure Blob storage for analytics data"
type: "azure"
azure:
storage_account: "analytics-account"
storage_access_key: "EXAMPLE45551FSAsVVCXCF"
This hybrid setup allows lakeFS to manage data across both cloud and on-prem storages.
blockstores:
signing:
secret_key: "some-secret"
stores:
- id: "s3-archive"
description: "AWS S3 storage for long-term archival"
type: "s3"
s3:
region: "us-west-2"
- id: "minio-fast-access"
description: "On-prem MinIO for high-performance workloads"
type: "s3"
s3:
force_path_style: true
endpoint: 'http://minio.local'
discover_bucket_region: false
credentials:
access_key_id: "minio_access_key"
secret_access_key: "minio_secret_key"
Key Considerations
- Unique Blockstore IDs: Each storage must have a unique id.
- Persistence of Blockstore IDs: Once defined, an id must not change.
- S3 Authentication Handling:
- All standard S3 authentication methods are supported.
- Every blockstore needs to be authenticated. So make sure to configure a profile or static credentials for all storages of type
s3
. S3 storage will use the credentials chain by default, so you might be able to use that for one storage.
Changing a storage ID is not supported and may result in unexpected behavior. Ensure IDs remain consistent once configured.
Upgrading from a single storage backend to Multiple Storage backends
When upgrading from a single storage backend to a multi-storage setup, follow these guidelines:
- Use the new
blockstores
structure, replacing the existingblockstore
configuration. Note thatblockstore
andblockstores
configurations are mutually exclusive - lakeFS does not support both simultaneously. - Define all previously available single-blockstore settings under their respective storage backends.
- The
signing.secret_key
is a required setting global to all connected stores. - Set
backward_compatible: true
for the existing storage backend to ensure:- Existing repositories continue to use the original storage backend.
- Newly created repositories default to this backend unless explicitly assigned a different one, to ensure a non-breaking upgrade process.
- This setting is mandatory — lakeFS will not function if it is unset.
- Do not remove this setting as long as you need to support repositories created before the upgrade. If removed, lakeFS will fail to start because it will treat existing repositories as disconnected from any configured storage.
Adding or Removing a Storage Backend
To add a storage backend, update the server configuration with the new storage entry and restart the server.
To remove a storage backend:
- Delete all repositories associated with the storage backend. (definition only)
- Remove the storage entry from the configuration.
- Restart the server.
lakeFS will fail to start if there are repositories defined on a removed storage. Ensure all necessary cleanup is completed before removing a storage backend.
Listing Connected Storage Backends
The Get Config API endpoint returns a list of storage configurations. In multi-storage setups, this is the recommended method to list connected storage backends and view their details.
Troubleshooting
Issue | Cause | Solution |
---|---|---|
Blockstore ID conflicts | Duplicate id values in stores |
Ensure each storage backend has a unique ID |
Missing backward_compatible |
Upgrade from single to multi-storage without setting the flag | Add backward_compatible: true for the existing storage |
Unsupported configurations in OSS or unlicensed Enterprise accounts | Using multi-storage features in an unsupported setup | Contact us to start using the feature |
Migrating from Multiple Storage Backend to Single Storage Backend
Once you upgrade to a multi-storage setup, you cannot simply revert back by changing the configuration from blockstores
to blockstore
. The internal repository metadata format changes to support multiple storage backends, and is not backward compatible with the single storage format. If you need to consolidate your data and revert from a multi-storage setup to a single storage backend, you’ll need to perform a full migration by following these steps:
Overview
The migration process involves:
- Dumping repository references from the multi-storage setup
- Deleting repositories in the multi-storage environment
- Configuring lakeFS with a single storage backend
- Restoring repositories to the single storage environment
Step-by-Step Guide
Use the lakefs-refs.py
script, instruction on how to aquire found in Backup and Restore.
-
Dump Repository References
To dump the repository metadata:
# Dump a single repository python lakefs-refs.py dump my-repository # Or dump all repositories python lakefs-refs.py dump --all
This will create manifest files for each repository.
Optionally, you can use the
--rm
flag to automatically delete repositories after successful dump:# Dump and delete a single repository python lakefs-refs.py dump my-repository --rm # Or dump and delete all repositories python lakefs-refs.py dump --all --rm
-
Delete Source Repositories (if not using –rm flag)
If you didn’t use the
--rm
flag in step 1, you’ll need to delete the repositories manually. Note that deleting a repository only removes the repository record from lakeFS - it does not delete the actual data files or metadata from your storage.You can delete repositories through:
a. The lakeFS UI b. Using lakectl (ex:
lakectl repo delete lakefs://my-repository
) -
Configure lakeFS Single Storage
- Update your lakeFS configuration to use a single storage backend (using the
blockstore
section instead ofblockstores
). - Start or restart lakeFS after applying the new configuration
- Example of single storage backend configuration: (multi storage backend example can be found in the Configuration section)
... blockstore: type: s3 s3: region: us-east-1 ...
- Update your lakeFS configuration to use a single storage backend (using the
-
Copy Repository Data (if needed)
If the repositories you want to restore were created on a storage system that lakeFS is no longer connected to:
a. Copy data from the old storage locations to the new one:
# Example for S3
aws s3 sync s3://old-bucket/path/to/storge-namespace s3://new-bucket/path/to/storage-namespace
# Alternative: Using rclone for cross-provider transfers
# rclone supports various storage providers (S3, Azure, GCS, etc.)
rclone sync azure:old-container/path/to/storage-namespace aws:new-bucket/path/to/storage-namespace
b. Update the manifest file, created on step 1 (dump repository references) with the new storage namespace:
{
"repository": {
"name": "my-repository",
"storage_namespace": "s3://new-bucket/path/to/repo", // ... update here ...
"default_branch": "main",
"storage_id": "storage-1"
},
"refs": {
// ... existing refs data ...
}
}
-
Restore Repositories
Note: If you copied the data to a new location in step 4, make sure to update the storage namespace in the manifest files before restoring.
Use the
--ignore-storage-id
flag to ensure repositories are created without storage IDs in the single-storage environment:# Restore a single repository python lakefs-refs.py restore my-repository_manifest.json --ignore-storage-id # Or restore multiple repositories python lakefs-refs.py restore repo1_manifest.json repo2_manifest.json --ignore-storage-id
Important Notes
- Keep the manifest files safe as they contain repository metadata
- If using different storage backends, ensure proper access permissions to copy the data
- The
--commit
flag can be used if you want to ensure all changes are committed before dumping - Make sure the new storage backend has sufficient space for all repository data
- A lakeFS instance configured with a single storage type will not start if repositories created on multiple storage setup still exist
Working with Repositories
After setting up lakeFS Enterprise to connect with multiple storage backends, this section explains how to use these connected storages when working with lakeFS.
With multiple storage backends configured, lakeFS repositories are now linked to a specific storage. Together with the repository’s storage namespace, this defines the exact location in the underlying storage where the repository’s data is stored.
The choice of storage backend impacts the following lakeFS operations:
Creating a Repository
In a multi-storage setup, users must specify a storage ID when creating a repository. This can be done using the following methods:
Select a storage backend from the dropdown menu.
Use the --storage-id
flag with the repo create command:
lakectl repo create lakefs://my-repo s3://my-bucket --storage-id my-storage
Note: The --storage-id
flag is currently hidden in the CLI.
Use the storage_id
parameter in the Create Repository endpoint.
Starting from version 0.9.0 of the High-level Python SDK,
you can use kwargs
to pass storage_id
dynamically when calling the create repository method:
import lakefs
repo = lakefs.Repository("example-repo").create(storage_namespace="s3://storage-bucket/repos/example-repo", storage_id="my-storage-id")
Important notes:
- In multi-storage setups where a storage backend is marked as
backward_compatible: true
, repository creation requests without a storage ID will default to this storage. - If no storage backend is marked as
backward_compatible
, repository creation requests without a storage ID will fail. - Each repository is linked to a single backend and stores data within a single storage namespace on that backend.
Viewing Repository Details
To check which storage backend is associated with a repository:
The storage ID is displayed under “Storage” in the repository settings page.
Use the List Repositories endpoint. Its response includes the storage ID.
Importing Data into a Repository
Importing data into a repository is supported when the credentials used for the repository’s backing blockstore allow read and list access to the storage location.
Limitations
Supported storages
Multi-storage backend support has been validated on:
- Self-managed S3-compatible object storage (MinIO)
- Amazon S3
- Local storage
Note: Other storage backends may work but have not been officially tested. If you’re interested in exploring additional configurations, please reach contact us.
Unsupported clients
The following clients do not currently support working with multiple storage backends. However, we are actively working to bridge this gap: