Backup and Restore Repository
This section explains how to backup and restore lakeFS repository for different use-cases:
- Disaster Recovery: you want to backup the repository regularly so you can restore it in case of any disaster. You’d also need to make sure to backup the repository’s storage namespace to another, preferably geographically separate location.
- Migrate Repository: you want to migrate a repository from one environment to another lakeFS environment.
- Clone Repository: you want to clone a repository.
Refer to Python Sample Notebooks to backup, migrate or clone a lakeFS repository
Commit Changes
Backup process doesn’t backup uncommitted data, so make sure to commit any staged writes before running the backup. But this is an optional process.
You can manually commit the changes by using lakeFS UI or you can programatically commit any uncommitted changes. This Python example shows how to programatically loop through all branches in lakeFS and commit any uncommitted data but this might take a lot of time if you have many branches in the repo:
import lakefs
repo = lakefs.Repository("example-repo")
for branch in repo.branches():
for diff in repo.branch(branch.id).uncommitted():
repo.branch(branch.id).commit(message='Committed changes to backup the repository')
break
Backup Repository
Dump Metadata
Dump metadata/refs of the repository by using lakeFS API or CLI.
- Example code to dump metadata by using lakeFS Python SDK (this process will create
_lakefs/refs_manifest.json
file in your storage namespace for the repository):
lakefs_sdk_client.internal_api.dump_refs("example-repo")
- Example commands to dump metadata by using lakeFS CLI and upload to S3 storage for the repository:
lakectl refs-dump lakefs://example-repo > refs_manifest.json
aws s3 cp refs_manifest.json s3://source-bucket-name/example-repo/_lakefs/refs_manifest.json
- Example commands to dump metadata by using lakeFS CLI and upload to Azure Blob storage for the repository:
lakectl refs-dump lakefs://example-repo > refs_manifest.json
az storage blob upload --file refs_manifest.json --container-name sourceContainer --name example-repo/_lakefs/refs_manifest.json --account-name source-storage-account-name --account-key <source-storage-account-key>
Shutdown lakeFS services immediately after dumping the metadata so nobody can make any changes in the source repository.
Copy Data to Backup Storage Location
Copy the repository’s storage namespace to another, preferably geographically separate location. Copy command depends on the type of object storage and the tool that you use.
- Example S3 command:
aws s3 sync s3://source-bucket-name/example-repo s3://target-bucket-name/example-repo
- Example Azure azcopy command:
azcopy copy 'https://source-storage-account-name.blob.core.windows.net/sourceContainer/example-repo/*?source_container_SAS_token' 'https://target-storage-account-name.blob.core.windows.net/targetContainer/example-repo?target_container_SAS_token' --recursive
You can restart lakeFS services after copying the data to backup storage location.
Restore Repository
Create a new Bare Repository
Create a bare lakeFS repository with a new name if you want to clone the repository or use the same repository name if you want to migrate or restore the repository.
- Python example to create a bare lakeFS repository using S3 storage:
lakefs.Repository("target-example-repo").create(bare=True, storage_namespace="s3://target-bucket-name/example-repo", default_branch="same-default-branch-as-in-source-repo")
- Python example to create a bare lakeFS repository using Azure storage:
lakefs.Repository("target-example-repo").create(bare=True, storage_namespace="https://target-storage-account-name.blob.core.windows.net/targetContainer/example-repo", default_branch="same-default-branch-as-in-source-repo")
- lakeFS CLI command to create a bare lakeFS repository using S3 storage:
lakectl repo create-bare lakefs://target-example-repo s3://target-bucket-name/example-repo --default-branch "same-default-branch-as-in-source-repo"
- lakeFS CLI command to create a bare lakeFS repository using Azure storage:
lakectl repo create-bare lakefs://target-example-repo https://target-storage-account-name.blob.core.windows.net/targetContainer/example-repo --default-branch "same-default-branch-as-in-source-repo"
Restore Metadata to new Repository
Run restore_refs to load back all commits, tags and branches.
- Python example to restore metadata to new repository. First download metadata(refs_manifest.json) file created by metadata dump process:
aws s3 cp s3://target-bucket-name/example-repo/_lakefs/refs_manifest.json .
azcopy copy 'https://target-storage-account-name.blob.core.windows.net/targetContainer/example-repo/_lakefs/refs_manifest.json?<target_container_SAS_token>' .
Then read refs_manifest.json file and restore metadata to new repository:
with open('./refs_manifest.json') as file:
refs_manifest_json = json.load(file)
print(refs_manifest_json)
target_lakefs_sdk_client.internal_api.restore_refs(target_repo_name, refs_manifest_json)
- lakeFS CLI command to restore metadata to new repository using S3 storage:
aws s3 cp s3://target-bucket-name/example-repo/_lakefs/refs_manifest.json - | lakectl refs-restore lakefs://target-example-repo --manifest -
- lakeFS CLI command to restore metadata to new repository using Azure storage:
az storage blob download --container-name targetContainer --name example-repo/_lakefs/refs_manifest.json --account-name target-storage-account-name --account-key <target-storage-account-key> | lakectl refs-restore lakefs://target-example-repo --manifest -
Note: If you are running backups regularly, it is highly advised to test the restore process periodically to make sure that you are able to restore the repository in case of disaster.
Python Helper Script for Backup and Restore
For more streamlined repository backup and restore operations, you can use the lakefs-refs.py
script available in the lakeFS repository.
Overview
The lakefs-refs.py
script automates the backup and restore procedures described in this document. It handles all the necessary steps to dump and restore lakeFS repository references, making the process simpler and less error-prone for repository migration and backup purposes.
Prerequisites
Install the required dependencies:
pip install -r requirements.txt
Usage Examples
Dump Repository References
The script provides an easy way to dump repository metadata:
python lakefs-refs.py dump <repository-name> [--all] [--commit] [--rm] [--endpoint-url <url>] [--access-key-id <key>] [--secret-access-key <secret>]
Options:
<repository-name>
: Name of the specific repository to dump--all
: Dump all repositories instead of a specific one--commit
: Commit any uncommitted changes before dumping--rm
: Delete repository definition after successful dump- Authentication options (see below)
Restore Repository References
For restoring repository metadata from manifest files:
python lakefs-refs.py restore <manifest-file> [<manifest-file2> ...] [--ignore-storage-id] [--endpoint-url <url>] [--access-key-id <key>] [--secret-access-key <secret>]
Options:
<manifest-file>
: One or more manifest files to restore--ignore-storage-id
: Create repository without storage_id (useful when migrating to a different storage backend)- Authentication options (see below)
Authentication
The script uses the same authentication method as lakectl
, supporting authentication via command line parameters, environment variables (LAKECTL_*
), or the ~/.lakectl.yaml
configuration file.