Import data into lakeFS
The simplest way to bring data into lakeFS is by copying it, but this approach may not be suitable when a lot of data is involved. To avoid copying the data, lakeFS offers Zero-copy import. With this approach, lakeFS only creates pointers to your existing objects in your new repository.
Zero-copy import
Prerequisites
User Permissions
To run import you need the following permissions:
fs:WriteObject
, fs:CreateMetaRange
, fs:CreateCommit
, fs:ImportFromStorage
and fs:ImportCancel
.
The first 3 permissions are available by default to users in the default Developers group (RBAC) or the
Writers group (ACL). The Import*
permissions enable the user to import data from any location of the storage
provider that lakeFS has access to and cancel the operation if needed.
Thus, it’s only available to users in group Supers (ACL) or SuperUsers(RBAC).
RBAC installations can modify policies to add that permission to any group, such as Developers.
lakeFS Permissions
lakeFS must have permissions to list the objects in the source object store,
and the source bucket must be in the same region of your destination bucket.
In addition, see the following storage provider specific instructions:
AWS S3: Importing from public buckets
lakeFS needs access to the imported location to first list the files to import and later read the files upon users request.
There are some use cases where the user would like to import from a destination which isn’t owned by the account running lakeFS. For example, importing public datasets to experiment with lakeFS and Spark.
lakeFS will require additional permissions to read from public buckets. For example, for S3 public buckets, the following policy needs to be attached to the lakeFS S3 service-account to allow access to public buckets, while blocking access to other owned buckets:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PubliclyAccessibleBuckets",
"Effect": "Allow",
"Action": [
"s3:GetBucketVersioning",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": ["*"],
"Condition": {
"StringNotEquals": {
"s3:ResourceAccount": "<YourAccountID>"
}
}
}
]
}
See Azure deployment on limitations when using account credentials.
Azure Data Lake Gen2
lakeFS requires a hint in the import source URL to understand that the provided storage account is ADLS Gen2
For source account URL:
https://<my-account>.core.windows.net/path/to/import/
Please add the *adls* subdomain to the URL as follows:
https://<my-account>.adls.core.windows.net/path/to/import/
No specific prerequisites
Using the lakeFS UI
To import using the UI, lakeFS must have permissions to list the objects in the source object store.
-
In your repository’s main page, click the Import button to open the import dialog:
- Under Import from, fill in the location on your object store you would like to import from.
- Fill in the import destination in lakeFS
- Add a commit message, and optionally metadata.
- Press Import
Once the import is complete, the changes are merged into the destination branch.
Notes
- Import uses the
src-wins
merge strategy. Therefore - import of existing objects nad prefixes in destination will override them. - The import duration depends on the amount of imported objects, but will roughly be a few thousand objects per second.
lakectl import
Prerequisite: have lakectl installed.
The lakectl import command acts the same as the UI import wizard. It commits the changes to a dedicated branch, with an optional
flag to merge the changes to <branch_name>
.
lakectl import \
--from s3://bucket/optional/prefix/ \
--to lakefs://my-repo/my-branch/optional/path/
lakectl import \
--from https://storageAccountName.blob.core.windows.net/container/optional/prefix/ \
--to lakefs://my-repo/my-branch/optional/path/
lakectl import \
--from gs://bucket/optional/prefix/ \
--to lakefs://my-repo/my-branch/optional/path/
Limitations
- Importing is only possible from the object storage service in which your installation stores its data. For example, if lakeFS is configured to use S3, you cannot import data from Azure.
- Import is available for S3, GCP and Azure.
- For security reasons, if you are lakeFS on top of your local disk, you need to enable the import feature explicitly.
To do so, set the
blockstore.local.import_enabled
totrue
and specify the allowed import paths inblockstore.local.allowed_external_prefixes
(see configuration reference). Since there are some differences between object-stores and file-systems in the way directories/prefixes are treated, local import is allowed only for directories.
Working with imported data
Note that lakeFS cannot manage your metadata if you make changes to data in the original bucket. The following table describes the results of making changes in the original bucket, without importing it to lakeFS:
Object action in the original bucket | ListObjects result in lakeFS | GetObject result in lakeFS |
---|---|---|
Create | Object not visible | Object not accessible |
Overwrite | Object visible with outdated metadata | Updated object accessible |
Delete | Object visible | Object not accessible |
Copying data into a lakeFS repository
Another way of getting existing data into a lakeFS repository is by copying it. This has the advantage of having the objects along with their metadata managed by the lakeFS installation, along with lifecycle rules, immutability guarantees and consistent listing. However, do make sure to account for storage cost and time.
To copy data into lakeFS you can use the following tools: