Multi-Storage Backend
lakeFS Enterprise
Multi-storage backend support is only available to licensed lakeFS Enterprise customers. Contact us to get started!
What is Multi-storage Backend Support?
lakeFS multi-storage backend support enables seamless data management across multiple storage systems — on-premises, across public clouds, or hybrid environments. This capability makes lakeFS a unified data management platform for all organizational data assets, which is especially critical in AI/ML environments that rely on diverse datasets stored in multiple locations.
With a multi-store setup, lakeFS can connect to and manage any combination of supported storage systems, including:
- AWS S3
- Azure Blob
- Google Cloud Storage
- other S3-compatible storage
- local storage
Multi-storage backends support is available from version v1.51.0 of lakeFS Enterprise.
Use Cases
- Distributed Data Management:
- Eliminate data silos and enable seamless cross-cloud collaboration.
- Maintain version control across different storage providers for consistency and reproducibility.
- Ideal for AI/ML environments where datasets are distributed across multiple storage locations.
- Unified Data Access:
- Access data across multiple storage backends using a single, consistent URI format.
- Centralized Access Control & Governance:
- Access permissions and policies can be centrally managed across all connected storage systems using lakeFS RBAC.
- Compliance and security controls remain consistent, regardless of where the data is stored.
Configuration
To configure your lakeFS server to connect to multiple storage backends, define them under the blockstores
section in your server configurations.
The blockstores.stores
field is an array of storage backends, each with its own configuration.
For a complete list of available options, refer to the server configuration reference.
Note: If you’re upgrading from a single-store lakeFS setup, refer to the upgrade guidelines to ensure a smooth transition.
Example Configurations
This example setup configures lakeFS to manage data across two separate MinIO instances:
blockstores:
signing:
secret_key: "some-secret"
stores:
- id: "minio-prod"
description: "Primary on-prem MinIO storage for production data"
type: "s3"
s3:
force_path_style: true
endpoint: 'http://minio-prod.local'
discover_bucket_region: false
credentials:
access_key_id: "prod_access_key"
secret_access_key: "prod_secret_key"
- id: "minio-backup"
description: "Backup MinIO storage for disaster recovery"
type: "s3"
s3:
force_path_style: true
endpoint: 'http://minio-backup.local'
discover_bucket_region: false
credentials:
access_key_id: "backup_access_key"
secret_access_key: "backup_secret_key"
This example setup configures lakeFS to manage data across two public cloud providers: AWS and Azure:
blockstores:
signing:
secret_key: "some-secret"
stores:
- id: "s3-prod"
description: "AWS S3 storage for production data"
type: "s3"
s3:
region: "us-east-1"
- id: "azure-analytics"
description: "Azure Blob storage for analytics data"
type: "azure"
azure:
storage_account: "analytics-account"
storage_access_key: "EXAMPLE45551FSAsVVCXCF"
This hybrid setup allows lakeFS to manage data across both cloud and on-prem storages.
blockstores:
signing:
secret_key: "some-secret"
stores:
- id: "s3-archive"
description: "AWS S3 storage for long-term archival"
type: "s3"
s3:
region: "us-west-2"
- id: "minio-fast-access"
description: "On-prem MinIO for high-performance workloads"
type: "s3"
s3:
force_path_style: true
endpoint: 'http://minio.local'
discover_bucket_region: false
credentials:
access_key_id: "minio_access_key"
secret_access_key: "minio_secret_key"
Key Considerations
- Unique Blockstore IDs: Each storage must have a unique id.
- Persistence of Blockstore IDs: Once defined, an id must not change.
- S3 Authentication Handling:
- All standard S3 authentication methods are supported.
- Every blockstore needs to be authenticated. So make sure to configure a profile or static credentials for all storages of type
s3
. S3 storage will use the credentials chain by default, so you might be able to use that for one storage.
Changing a storage ID is not supported and may result in unexpected behavior. Ensure IDs remain consistent once configured.
Upgrading from a single storage backend to Multiple Storage backends
When upgrading from a single storage backend to a multi-storage setup, follow these guidelines:
- Use the new
blockstores
structure, replacing the existingblockstore
configuration. Note thatblockstore
andblockstores
configurations are mutually exclusive - lakeFS does not support both simultaneously. - Define all previously available single-blockstore settings under their respective storage backends.
- The
signing.secret_key
is a required setting global to all connected stores. - Set
backward_compatible: true
for the existing storage backend to ensure:- Existing repositories continue to use the original storage backend.
- Newly created repositories default to this backend unless explicitly assigned a different one, to ensure a non-breaking upgrade process.
- This setting is mandatory — lakeFS will not function if it is unset.
- Do not remove this setting as long as you need to support repositories created before the upgrade. If removed, lakeFS will fail to start because it will treat existing repositories as disconnected from any configured storage.
Adding or Removing a Storage Backend
To add a storage backend, update the server configuration with the new storage entry and restart the server.
To remove a storage backend:
- Delete all repositories associated with the storage backend.
- Remove the storage entry from the configuration.
- Restart the server.
lakeFS will fail to start if there are repositories defined on a removed storage. Ensure all necessary cleanup is completed before removing a storage backend.
Listing Connected Storage Backends
The Get Config API endpoint returns a list of storage configurations. In multi-storage setups, this is the recommended method to list connected storage backends and view their details.
Troubleshooting
Issue | Cause | Solution |
---|---|---|
Blockstore ID conflicts | Duplicate id values in stores |
Ensure each storage backend has a unique ID |
Missing backward_compatible |
Upgrade from single to multi-storage without setting the flag | Add backward_compatible: true for the existing storage |
Unsupported configurations in OSS or unlicensed Enterprise accounts | Using multi-storage features in an unsupported setup | Contact us to start using the feature |
Working with Repositories
After setting up lakeFS Enterprise to connect with multiple storage backends, this section explains how to use these connected storages when working with lakeFS.
With multiple storage backends configured, lakeFS repositories are now linked to a specific storage. Together with the repository’s storage namespace, this defines the exact location in the underlying storage where the repository’s data is stored.
The choice of storage backend impacts the following lakeFS operations:
Creating a Repository
In a multi-storage setup, users must specify a storage ID when creating a repository. This can be done using the following methods:
Select a storage backend from the dropdown menu.
Use the --storage-id
flag with the repo create command:
lakectl repo create lakefs://my-repo s3://my-bucket --storage-id my-storage
Note: The --storage-id
flag is currently hidden in the CLI.
Use the storage_id
parameter in the Create Repository endpoint.
Starting from version 0.9.0 of the High-level Python SDK,
you can use kwargs
to pass storage_id
dynamically when calling the create repository method:
import lakefs
repo = lakefs.Repository("example-repo").create(storage_namespace="s3://storage-bucket/repos/example-repo", storage_id="my-storage-id")
Important notes:
- In multi-storage setups where a storage backend is marked as
backward_compatible: true
, repository creation requests without a storage ID will default to this storage. - If no storage backend is marked as
backward_compatible
, repository creation requests without a storage ID will fail. - Each repository is linked to a single backend and stores data within a single storage namespace on that backend.
Viewing Repository Details
To check which storage backend is associated with a repository:
The storage ID is displayed under “Storage” in the repository settings page.
Use the List Repositories endpoint. Its response includes the storage ID.
Importing Data into a Repository
Importing data into a repository is supported when the credentials used for the repository’s backing blockstore allow read and list access to the storage location.
Limitations
Supported storages
Multi-storage backend support has been validated on:
- Self-managed S3-compatible object storage (MinIO)
- Amazon S3
- Local storage
Note: Other storage backends may work but have not been officially tested. If you’re interested in exploring additional configurations, please reach contact us.
Unsupported clients
The following clients do not currently support working with multiple storage backends. However, we are actively working to bridge this gap: