This section describes how to import existing data into a lakeFS repository, without copying it. If you are interested in copying data into lakeFS, see Copying data to/from lakeFS.
Importing data into lakeFS
Prerequisites
- Importing is permitted for users in the Supers (open-source) group or the SuperUsers (Cloud/Enterprise) group. To learn how lakeFS Cloud and lakeFS Enterprise users can fine-tune import permissions, see Fine-grained permissions below.
- The lakeFS server must have permissions to list the objects in the source bucket.
- The source bucket must be on the same cloud provider and in the same region as your repository.
Using the lakeFS UI
- In your repository’s main page, click the Import button to open the import dialog.
- Under Import from, fill in the location on your object store you would like to import from.
- Fill in the import destination in lakeFS. This should be a path under the current branch.
- Add a commit message, and optionally commit metadata.
- Press Import.
Once the import is complete, a new commit containing the imported objects will be created in the destination branch.
Using the CLI: lakectl import
The lakectl import command acts the same as the UI import wizard. It commits the changes to the selected branch.
lakectl import \
--from s3://bucket/optional/prefix/ \
--to lakefs://my-repo/my-branch/optional/path/
lakectl import \
--from https://storageAccountName.blob.core.windows.net/container/optional/prefix/ \
--to lakefs://my-repo/my-branch/optional/path/
lakectl import \
--from gs://bucket/optional/prefix/ \
--to lakefs://my-repo/my-branch/optional/path/
Notes
- Any previously existing objects under the destination prefix will be deleted.
- The import duration depends on the amount of imported objects, but will roughly be a few thousand objects per second.
- For security reasons, if you are using lakeFS on top of your local disk (
blockstore.type=local
), you need to enable the import feature explicitly. To do so, set theblockstore.local.import_enabled
totrue
and specify the allowed import paths inblockstore.local.allowed_external_prefixes
(see configuration reference). When using lakectl or the lakeFS UI, you can currently import only directories locally. If you need to import a single file, use the HTTP API or API Clients withtype=object
in the request body anddestination=<full-path-to-file>
. - Making changes to data in the original bucket will not be reflected in lakeFS, and may cause inconsistencies.
Examples
To explore practical examples and real-world use cases of importing data into lakeFS, we recommend checking out our comprehensive blog post on the subject.
Fine-grained permissions
lakeFS Cloud
lakeFS Enterprise
With RBAC support, The lakeFS user running the import command should have the following permissions in lakeFS:
fs:WriteObject
, fs:CreateMetaRange
, fs:CreateCommit
, fs:ImportFromStorage
and fs:ImportCancel
.
As mentioned above, all of these permissions are available by default to the Supers (open-source) group or the SuperUsers (Cloud/Enterprise).
Provider-specific permissions
In addition, the following for provider-specific permissions may be required:
AWS S3: Importing from public buckets
lakeFS needs access to the imported location to first list the files to import and later read the files upon users request.
There are some use cases where the user would like to import from a destination which isn’t owned by the account running lakeFS. For example, importing public datasets to experiment with lakeFS and Spark.
lakeFS will require additional permissions to read from public buckets. For example, for S3 public buckets, the following policy needs to be attached to the lakeFS S3 service-account to allow access to public buckets, while blocking access to other owned buckets:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PubliclyAccessibleBuckets",
"Effect": "Allow",
"Action": [
"s3:GetBucketVersioning",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": ["*"],
"Condition": {
"StringNotEquals": {
"s3:ResourceAccount": "<YourAccountID>"
}
}
}
]
}
Note: The use of the adls
hint for ADLS Gen2 storage accounts is deprecated, please use the original source url for import.
See Azure deployment on limitations when using account credentials.
No specific prerequisites