Link Search Menu Expand Document

Migrating away from lakeFS

Table of contents

  1. Copying data from a lakeFS repository to an S3 bucket
  2. Using treeverse-distcp

Copying data from a lakeFS repository to an S3 bucket

The simplest way to migrate away from lakeFS is to copy data from a lakeFS repository to an S3 bucket (or any other object store).

For smaller repositories, this could be done using the AWS cli or rclone. For larger repositories, running distcp with lakeFS as the source is also an option.

Using treeverse-distcp

If for some reason, lakeFS is not accessible, we can still migrate data to S3 using treeverse-distcp - assuming the underlying S3 bucket is intact. Here’s how to do it:

  1. Create a Copy Manifest - this file describes the source and destination for every object we want to copy. It is a mapping between lakeFS’ internal storage addressing and the paths of the objects as we’d expect to see them in S3.

    To generate a manifest, connect to the PostgreSQL instance used by lakeFS and run the following command:

    psql \
      --var "repository_name=repo1" \
      --var "branch_name=master" \
      --var "dst_bucket_name=bucket1" \
      postgres < create-extraction-manifest.sql > manifest.csv
    

    You can download the create-extraction-manifest.sql script from the lakeFS GitHub repository.

    Note This manifest is useful for recovery - it will allow you to restore service in case the PostgreSQL database is for some reason not accessible. For safety, you can automate the creation of this manifest to happen daily.

  2. Copy the manifest to S3. Once copied, keep note of its etag - we’ll need this to run the copy batch job:

    cp /path/to/manifest.csv s3://my-bucket/path/to/manifest.csv
    aws s3api head-object --bucket my-bucket --key path/to-manifest/csv | jq -r .ETag # Or look for ETag in the output
    
  3. Once we have a manifest, let’s define a S3 batch job that will copy all files for us. To do this, let’s start by creating an IAM role called lakeFSExportJobRole, and grant it permissions as described in “Granting permissions for Batch Operations”
  4. Once we have an IAM role, let’s install the treeverse-distcp lambda function

    Make a note of the Lambda function ARN – this is required for running an S3 Batch Job.

  5. Take note of your account ID - this is required for running an S3 Batch Job:

    aws sts get-caller-identity | jq -r .Account
    
  6. Dispatch a copy job using the run_copy.py script:

    run_copy.py \
      --account-id "123456789" \
      --csv-path "s3://s3://my-bucket/path/to/manifest" \
      --csv-etag "..." \
      --report-path "s3://another-bucket/prefix/for/reports" \
      --lambda-handler-arn "arn:lambda:..."
    
  7. You will get a job number. Now go to the AWS S3 Batch Operations Console, switch to the region of your bucket, and confirm execution of that job.