The simplest way to migrate away from lakeFS is to copy data from a lakeFS repository to an S3 bucket (or any other object store).
If for some reason, lakeFS is not accessible, we can still migrate data to S3 using treeverse-distcp - assuming the underlying S3 bucket is intact. Here’s how to do it:
Create a Copy Manifest - this file describes the source and destination for every object we want to copy. It is a mapping between lakeFS’ internal storage addressing and the paths of the objects as we’d expect to see them in S3.
To generate a manifest, connect to the PostgreSQL instance used by lakeFS and run the following command:
psql \ --var "repository_name=repo1" \ --var "branch_name=master" \ --var "dst_bucket_name=bucket1" \ postgres < create-extraction-manifest.sql > manifest.csv
You can download the
create-extraction-manifest.sqlscript from the lakeFS GitHub repository.
Note This manifest is useful for recovery - it will allow you to restore service in case the PostgreSQL database is for some reason not accessible. For safety, you can automate the creation of this manifest to happen daily.
Copy the manifest to S3. Once copied, keep note of its etag - we’ll need this to run the copy batch job:
cp /path/to/manifest.csv s3://my-bucket/path/to/manifest.csv aws s3api head-object --bucket my-bucket --key path/to-manifest/csv | jq -r .ETag # Or look for ETag in the output
- Once we have a manifest, let’s define a S3 batch job that will copy all files for us. To do this, let’s start by creating an IAM role called
lakeFSExportJobRole, and grant it permissions as described in “Granting permissions for Batch Operations”
Once we have an IAM role, let’s install the
Make a note of the Lambda function ARN – this is required for running an S3 Batch Job.
Take note of your account ID - this is required for running an S3 Batch Job:
aws sts get-caller-identity | jq -r .Account
Dispatch a copy job using the
run_copy.py \ --account-id "123456789" \ --csv-path "s3://s3://my-bucket/path/to/manifest" \ --csv-etag "..." \ --report-path "s3://another-bucket/prefix/for/reports" \ --lambda-handler-arn "arn:lambda:..."
- You will get a job number. Now go to the AWS S3 Batch Operations Console, switch to the region of your bucket, and confirm execution of that job.