Garbage Collection: committed objects
By default, lakeFS keeps all your objects forever. This allows you to travel back in time to previous versions of your data. However, sometimes you may want to hard-delete your objects - namely, delete them from the underlying storage. Reasons for this include cost-reduction and privacy policies.
Garbage collection rules in lakeFS define for how long to retain objects after they have been deleted. lakeFS provides a Spark program to hard-delete objects whose retention period has ended according to the GC rules.
This program does not remove any commits: you will still be able to use commits containing hard-deleted objects,
but trying to read these objects from lakeFS will result in a 410 Gone
HTTP status.
lakeFS Cloud users enjoy a managed Garbage Collection service, and do not need to run this Spark program.
Understanding Garbage Collection
For every branch, the GC job retains deleted objects for the number of days defined for the branch. In the absence of a branch-specific rule, the default rule for the repository is used. If an object is present in more than one branch ancestry, it’s retained according to the rule with the largest number of days between those branches. That is, it’s hard-deleted only after the retention period has ended for all relevant branches.
Example GC rules for a repository:
{
"default_retention_days": 14,
"branches": [
{"branch_id": "main", "retention_days": 21},
{"branch_id": "dev", "retention_days": 7}
]
}
In the above example, objects are retained for 14 days after deletion by default. However, if they are present in the branch main
, they are retained for 21 days.
Objects present in the dev
branch (but not in any other branch) are retained for 7 days after they are deleted.
Configuring GC rules
To define garbage collection rules, either use the lakectl
command or the lakeFS web UI:
Create a JSON file with your GC rules:
cat <<EOT >> example_repo_gc_rules.json
{
"default_retention_days": 14,
"branches": [
{"branch_id": "main", "retention_days": 21},
{"branch_id": "dev", "retention_days": 7}
]
}
EOT
Set the GC rules using lakectl
:
lakectl gc set-config lakefs://example-repo -f example_repo_gc_rules.json
From the lakeFS web UI:
- Navigate to the main page of your repository.
- Go to Settings -> Retention.
- Click Edit policy and paste your GC rule into the text box as a JSON.
- Save your changes.
Running the GC job
To run the job, use the following spark-submit
command (or using your preferred method of running Spark programs).
The job will hard-delete objects that were deleted and whose retention period has ended according to the GC rules.
spark-submit --class io.treeverse.clients.GarbageCollector \
--packages org.apache.hadoop:hadoop-aws:2.7.7 \
-c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1 \
-c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
-c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
-c spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \
-c spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \
http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-312-hadoop3/0.8.1/lakefs-spark-client-312-hadoop3-assembly-0.8.1.jar \
example-repo us-east-1
spark-submit --class io.treeverse.clients.GarbageCollector \
--packages org.apache.hadoop:hadoop-aws:2.7.7 \
-c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1 \
-c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
-c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
-c spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \
-c spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \
http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-301/0.8.1/lakefs-spark-client-301-assembly-0.8.1.jar \
example-repo us-east-1
spark-submit --class io.treeverse.clients.GarbageCollector \
--packages org.apache.hadoop:hadoop-aws:2.7.7 \
-c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1 \
-c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
-c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
-c spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \
-c spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \
http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-247/0.8.1/lakefs-spark-client-247-assembly-0.8.1.jar \
example-repo us-east-1
If you want to access your storage using the account key:
spark-submit --class io.treeverse.clients.GarbageCollector \
--packages org.apache.hadoop:hadoop-aws:3.2.1 \
-c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1 \
-c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
-c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
-c spark.hadoop.fs.azure.account.key.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=<AZURE_STORAGE_ACCESS_KEY> \
http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-312-hadoop3/0.8.1/lakefs-spark-client-312-hadoop3-assembly-0.8.1.jar \
example-repo
Or, if you want to access your storage using an Azure service principal:
spark-submit --class io.treeverse.clients.GarbageCollector \
--packages org.apache.hadoop:hadoop-aws:3.2.1 \
-c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1 \
-c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
-c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
-c spark.hadoop.fs.azure.account.auth.type.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=OAuth \
-c spark.hadoop.fs.azure.account.oauth.provider.type.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider \
-c spark.hadoop.fs.azure.account.oauth2.client.id.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=<application-id> \
-c spark.hadoop.fs.azure.account.oauth2.client.secret.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=<service-credential-key> \
-c spark.hadoop.fs.azure.account.oauth2.client.endpoint.<AZURE_STORAGE_ACCOUNT>.dfs.core.windows.net=https://login.microsoftonline.com/<directory-id>/oauth2/token \
http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-312-hadoop3/0.8.1/lakefs-spark-client-312-hadoop3-assembly-0.8.1.jar \
example-repo
Notes:
- On Azure, GC was tested only on Spark 3.3.0, but may work with other Spark and Hadoop versions.
- In case you don’t have
hadoop-azure
package as part of your environment, you should add the package to your spark-submit with--packages org.apache.hadoop:hadoop-azure:3.2.1
- For GC to work on Azure blob, soft delete should be disabled.
⚠️ At the moment, only the “mark” phase of the Garbage Collection is supported for GCP. That is, this program will output a list of expired objects, and you will have to delete them manually. We have concrete plans to extend this support to actually delete the objects.
spark-submit --class io.treeverse.clients.GarbageCollector \
--jars https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar \
-c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1 \
-c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
-c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
-c spark.hadoop.google.cloud.auth.service.account.enable=true \
-c spark.hadoop.google.cloud.auth.service.account.json.keyfile=<PATH_TO_JSON_KEYFILE> \
-c spark.hadoop.fs.gs.project.id=<GCP_PROJECT_ID> \
-c spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
-c spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS \
-c spark.hadoop.lakefs.gc.do_sweep=false \
http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-312-hadoop3/0.8.1/lakefs-spark-client-312-hadoop3-assembly-0.8.1.jar \
example-repo
This program will not delete anything. Instead, it will find all the objects that are safe to delete and save a list containing all their keys, in Parquet format. The list will then be found under the path:
gs://<STORAGE_NAMESPACE>/_lakefs/logs/gc/expired_addresses/
Note that this is a path in your Google Storage bucket, and not in your lakeFS repository.
For example, if your repository’s underlying storage is gs://example-bucket/example-path
, you will find the list in:
gs://example-bucket/example-path/_lakefs/logs/gc/expired_addresses/dt=<TIMESTAMP>/
You can now delete the objects appearing in the list from your Google Storage bucket.
You will find the list of objects hard-deleted by the job in the storage
namespace of the repository. It is saved in Parquet format under _lakefs/logs/gc/deleted_objects
.
GC job options
By default, GC first creates a list of expired objects according to your retention rules and then hard-deletes those objects. However, you can use GC options to break the GC job down into two stages:
- Mark stage: GC will mark the expired objects to hard-delete, without deleting them.
- Sweep stage: GC will hard-delete objects marked by a previous mark-only GC run.
By breaking GC into these stages, you can pause and create a backup of the objects that GC is about to sweep and later restore them. You can use the GC backup and restore utility to do that.
Mark only mode
To make GC run the mark stage only, add the following properties to your spark-submit command:
spark.hadoop.lakefs.gc.do_sweep=false
spark.hadoop.lakefs.gc.mark_id=<MARK_ID> # Replace <MARK_ID> with your own identification string. This MARK_ID will enable you to start a sweep (actual deletion) run later
Running in mark only mode, GC will write the addresses of the expired objects to delete to the following location: STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=<MARK_ID>/
as a parquet.
Notes:
- Mark only mode is only available from v0.4.0 of lakeFS Spark client.
- The
spark.hadoop.lakefs.debug.gc.no_delete
property has been deprecated with v0.4.0.
Sweep only mode
To make GC run the sweep stage only, add the following properties to your spark-submit command:
spark.hadoop.lakefs.gc.do_mark=false
spark.hadoop.lakefs.gc.mark_id=<MARK_ID> # Replace <MARK_ID> with the identifier you used on a previous mark-only run
Running in sweep only mode, GC will hard-delete the expired objects marked by a mark-only run and listed in: STORAGE_NAMESPACE/_lakefs/retention/gc/addresses/mark_id=<MARK_ID>/
.
Note: Mark only mode is only available from v0.4.0 of lakeFS Spark client.
Considerations
-
In order for an object to be hard-deleted, it must be deleted from all branches. You should remove stale branches to prevent them from retaining old objects. For example, consider a branch that has been merged to
main
and has become stale. An object which is later deleted frommain
will always be present in the stale branch, preventing it from being hard-deleted. -
lakeFS will never delete objects outside your repository’s storage namespace. In particular, objects that were imported using
lakectl ingest
or the UI import wizard will not be affected by GC jobs. -
In cases where deleted objects are brought back to life while a GC job is running, said objects may or may not be deleted. Such actions include:
- Reverting a commit in which a file was deleted.
- Branching out from an old commit.
- Expanding the retention period of a branch.
- Creating a branch from an existing branch, where the new branch has a longer retention period.
Backup and restore
GC was created to hard-delete objects from your underlying objects store according to your retention rules. However, when you start using the feature you may want to first gain confidence in the decisions GC makes. The GC backup and restore utility helps you do that.
Use-cases:
- Backup: copy expired objects from your repository’s storage namespace to an external location before running GC in sweep only mode.
- Restore: copy objects that were hard-deleted by GC from an external location you used for saving your backup into your repository’s storage namespace.
Follow rclone documentation to configure remote access to the underlying storage used by lakeFS.
Replace LAKEFS_STORAGE_NAMESPACE
with remote:bucket/path which points to the lakeFS repository storage namespace.
The BACKUP_STORAGE_LOCATION
attribute points to a storage location outside your lakeFS storage namespace into which you want to save the backup.
Backup command
rclone --include "*.txt" cat "<LAKEFS_STORAGE_NAMESPACE>/_lakefs/retention/gc/addresses.text/mark_id=<MARK_ID>/" | \
rclone -P --no-traverse --files-from - copy <LAKEFS_STORAGE_NAMESPACE> <BACKUP_STORAGE_LOCATION>
Restore command
rclone --include "*.txt" cat "<LAKEFS_STORAGE_NAMESPACE>/_lakefs/retention/gc/addresses.text/mark_id=<MARK_ID>/" | \
rclone -P --no-traverse --files-from - copy <BACKUP_STORAGE_LOCATION> <LAKEFS_STORAGE_NAMESPACE>
Example
The following of commands used to backup/resource a configured remote ‘azure’ (Azure blob storage) to access example repository storage namespace https://lakefs.blob.core.windows.net/repo/example/
:
# Backup
rclone --include "*.txt" cat "azure://repo/example/_lakefs/retention/gc/addresses.text/mark_id=a64d1885-6202-431f-a0a3-8832e4a5865a/" | \
rclone -P --no-traverse --files-from - copy azure://repo/example/ azure://backup/repo-example/
# Restore
rclone --include "*.txt" cat "azure://tal/azure-br/_lakefs/retention/gc/addresses.text/mark_id=a64d1885-6202-431f-a0a3-8832e4a5865a/" | \
rclone -P --no-traverse --files-from - copy azure://backup/repo-example/ azure://repo/example/