Link Search Menu Expand Document

Import data into lakeFS

Zero-copy import

Importing using the lakeFS UI

Prerequisites

lakeFS must have permissions to list the objects at the source object store, and in the same region of your destination bucket.

lakeFS supports two ways to ingest objects from the object store without copying the data:

  1. Importing using the lakeFS UI - A UI dialog to trigger an import to a designated import branch. It creates a commit from all imported objects.
  2. Importing using lakectl cli - You can use the lakectl CLI command to create uncommitted objects in a branch. It will make sequential calls between the CLI and the server.

Using the import wizard

Clicking the Import button from any branch will open the following dialog:

Import dialog example configured with S3

If it’s the first import to the selected branch, it will create the import branch named _<branch_name>_imported. lakeFS will import all objects from the Source URI to the import branch under the given prefix.

The UI will update periodically with the amount of objects imported. How long it takes depends on the amount of objects to be imported but will roughly be a few thousand objects per second.

img.png

Once the import is completed, you can merge the changes from the import branch to the source branch.

img.png

Importing using lakectl cli

The lakectl cli supports import and ingest commands to import objects from an external source.

  • The import command acts the same as the UI import wizard. It imports (zero copy) and commits the changes on _<branch_name>_imported branch with an optional flag to also merge the changes to <branch_name>.
  • The Ingest is listing the source bucket (and optional prefix) from the client, and creating pointers to the returned objects in lakeFS. The objects will be staged on the branch.

Using the lakectl import command

Usage
lakectl import \
  --from s3://bucket/optional/prefix/ \
  --to lakefs://my-repo/my-branch/optional/path/
lakectl import \
   --from https://storageAccountName.blob.core.windows.net/container/optional/prefix/ \
   --to lakefs://my-repo/my-branch/optional/path/
lakectl import \
   --from gs://bucket/optional/prefix/ \
   --to lakefs://my-repo/my-branch/optional/path/

The imported objects will be committed to _my-branch_imported branch. If the branch does not exist, it will be created. The flag --merge will merge the branch _my-branch_imported to my-branch after a successful import.

Using the lakectl ingest command

Prerequisites
  1. The user calling lakectl ingest has permissions to list the objects at the source object store.
  2. Recommended: The lakeFS installation has read permissions to the objects being ingested (to support downloading them directly from the lakeFS server)
  3. The source path is not a storage namespace used by lakeFS. For example, if lakefs://my-repo created with storage namespace s3://my-bucket, then s3://my-bucket/* cannot be an ingestion source.
Usage
lakectl ingest \
  --from s3://bucket/optional/prefix/ \
  --to lakefs://my-repo/ingest-branch/optional/path/

The lakectl ingest command will attempt to use the current user’s existing credentials and respect instance profiles, environment variables, and credential files (similarly to AWS CLI) Specify an endpoint to ingest from other S3 compatible storage solutions, e.g., add --s3-endpoint-url https://play.min.io.

export AZURE_STORAGE_ACCOUNT="storageAccountName"
export AZURE_STORAGE_ACCESS_KEY="EXAMPLEroozoo2gaec9fooTieWah6Oshai5Sheofievohthapob0aidee5Shaekahw7loo1aishoonuuquahr3=="
lakectl ingest \
   --from https://storageAccountName.blob.core.windows.net/container/optional/prefix/ \
   --to lakefs://my-repo/ingest-branch/optional/path/

The lakectl ingest command currently supports storage accounts configured through environment variables as shown above.

Note: Currently, lakectl import supports the http:// and https:// schemes for Azure storage URIs. wasb, abfs or adls are currently not supported.

export GOOGLE_APPLICATION_CREDENTIALS="$HOME/.gcs_credentials.json"  # Optional, will fallback to the default configured credentials
lakectl ingest \
   --from gs://bucket/optional/prefix/ \
   --to lakefs://my-repo/ingest-branch/optional/path/

The lakectl ingest command currently supports the standard GOOGLE_APPLICATION_CREDENTIALS environment variable as described in Google Cloud’s documentation.

Limitations to importing data

Importing is only possible from the object storage service in which your installation stores its data. For example, if lakeFS is configured to use S3, you cannot import data from Azure.

Although created by lakeFS, import branches behave like any other branch: Authorization policies, CI/CD triggering, branch protection rules and all other lakeFS concepts apply to them as well.

Working with imported data

Note that lakeFS cannot manage your metadata if you make changes to data in the original bucket. The following table describes the results of making changes in the original bucket, without importing it to lakeFS:

Object action in the original bucket ListObjects result in lakeFS GetObject result in lakeFS
Create Object not visible Object not accessible
Overwrite Object visible with outdated metadata Updated object accessible
Delete Object visible Object not accessible

AWS S3: Importing from public buckets

lakeFS needs access to the imported location to first list the files to import and later read the files upon users request.

There are some use cases where the user would like to import from a destination which isn’t owned by the account running lakeFS. For example, importing public datasets to experiment with lakeFS and Spark.

lakeFS will require additional permissions to read from public buckets. For example, for S3 public buckets, the following policy needs to be attached to the lakeFS S3 service-account to allow access to public buckets, while blocking access to other owned buckets:

   {
     "Version": "2012-10-17",
     "Statement": [
       {
         "Sid": "PubliclyAccessibleBuckets",
         "Effect": "Allow",
         "Action": [
            "s3:GetBucketVersioning",
            "s3:ListBucket",
            "s3:GetBucketLocation",
            "s3:ListBucketMultipartUploads",
            "s3:ListBucketVersions",
            "s3:GetObject",
            "s3:GetObjectVersion",
            "s3:AbortMultipartUpload",
            "s3:ListMultipartUploadParts"
         ],
         "Resource": ["*"],
         "Condition": {
           "StringNotEquals": {
             "s3:ResourceAccount": "<YourAccountID>"
           }
         }
       }
     ]
   }

Copying data into a lakeFS repository

Another way of getting existing data into a lakeFS repository is by copying it. This has the advantage of having the objects along with their metadata managed by the lakeFS installation, along with lifecycle rules, immutability guarantees and consistent listing. However, do make sure to account for storage cost and time.

To copy data into lakeFS you can use the following tools:

  1. The lakectl command line tool - see the reference to learn more about using it to copy local data into lakeFS. Using lakectl fs upload --recursive you can upload multiple objects together from a given directory.
  2. Using rclone
  3. Using Hadoop’s DistCp