Link Search Menu Expand Document

Garbage Collection

By default, lakeFS keeps all your objects forever. This allows you to travel back in time to previous versions of your data. However, sometimes you may want to hard-delete your objects, namely delete them from the underlying storage. Reasons for this include cost-reduction and privacy policies.

Garbage collection rules in lakeFS define for how long to retain objects after they have been deleted (see more information below). lakeFS provides a Spark program to hard-delete objects that have been deleted and whose retention period has ended according to the GC rules. The GC job does not remove any commits: you will still be able to use commits containing hard-deleted objects, but trying to read these objects from lakeFS will result in a 410 Gone HTTP status.

Note At this point, lakeFS supports Garbage Collection only on S3, but we have concrete plans to extend the support to Azure.

  1. Understanding Garbage Collection
    1. What gets collected
  2. Configuring GC rules
  3. Running the GC job
  4. Considerations

Understanding Garbage Collection

For every branch, the GC job retains deleted objects for the number of days defined for the branch. In the absence of a branch-specific rule, the default rule for the repository is used. If an object is present in more than one branch ancestry, it is retained according to the rule with the largest number of days between those branches. That is, it is hard-deleted only after the retention period has ended for all relevant branches.

Example GC rules for a repository:

{
  "default_retention_days": 14,
  "branches": [
    {"branch_id": "main", "retention_days": 21},
    {"branch_id": "dev", "retention_days": 7}
  ]
}

In the above example, objects are retained for 14 days after deletion by default. However, if they are present in the branch main, they are retained for 21 days. Objects present in the dev branch (but not in any other branch), are retained for 7 days after they are deleted.

What gets collected

Because each object in lakeFS may be accessible from multiple branches, it might not be obvious what objects will be considered garbage and collected.

Garbage collection is configured by specifying the number of days to retain objects on each branch. If a branch is configured to retain objects for a given number of days, any object that was accessible from the HEAD of a branch in that past number of days will be retained.

The garbage collection process proceeds in two main phases:

  • Discover which commits will retain their objects. For every branch, the garbage collection job looks at the HEAD of the branch that many days ago; every commit at or since that HEAD must be retained.

    mermaid diagram

    Continuing the example, branch main retains for 21 days and branch dev for 7. When running GC on 2022-03-31:

    • 7 days ago, on 2022-03-24 the head of branch dev was d: 2022-03-23. So that commit is retained (along with all more recent commits on dev), but all older commits d: * will be collected.
    • 21 days ago, on 2022-03-10, the head of branch main was 2022-03-09. So that commit is retained (along with all more recent commits on main), but commits 2022-02-27 and 2022-03-01 will be collected.
  • Discover which objects need to be garbage collected. Hold (only) objects accessible on some retained commits.

    In the example, all objects of commit 2022-03-12, for instance, are retained. This includes objects added in previous commits. However, objects added in commit d: 2022-03-14 which were overwritten or committed in commit d: 2022-03-20 are not visible in any retained commit and will be garbage collected.

  • Garbage collect those objects by deleting them. The data of any deleted object will no longer be accessible. lakeFS retains all metadata about the object, but attempting to read it via the lakeFS API or the S3 gateway will return HTTP status 410 (“Gone”).

Configuring GC rules

Using lakectl

Use the lakectl CLI to define the GC rules:

cat <<EOT >> example_repo_gc_rules.json
{
  "default_retention_days": 14,
  "branches": [
    {"branch_id": "main", "retention_days": 21},
    {"branch_id": "dev", "retention_days": 7}
  ]
}
EOT

lakectl gc set-config lakefs://example-repo -f example_repo_gc_rules.json 

From the lakeFS UI

  1. Navigate to the main page of your repository.
  2. Go to Settings -> Retention.
  3. Click Edit policy and paste your GC rule into the text box as a JSON.
  4. Save your changes.

GC Rules From UI

Running the GC job

The GC job is a Spark program that can be run using spark-submit (or using your preferred method of running Spark programs). The job will hard-delete objects that were deleted and whose retention period has ended according to the GC rules.

First, you’ll have to download the lakeFS Spark client Uber-jar. The Uber-Jar can be found on a public S3 location:

For Spark 2.4.7:
http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-247/${CLIENT_VERSION}/lakefs-spark-client-247-assembly-${CLIENT_VERSION}.jar

For Spark 3.0.1:
http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakefs-spark-client-301/${CLIENT_VERSION}/lakefs-spark-client-301-assembly-${CLIENT_VERSION}.jar

CLIENT_VERSIONs for Spark 2.4.7 can be found here, and for Spark 3.0.1 they can be found here.

Second, you should specify the Uber-jar path instead of <APPLICATION-JAR-PATH> and run the following command to make the garbage collector start running:

spark-submit --class io.treeverse.clients.GarbageCollector \
  -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1  \
  -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
  -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
  -c spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \
  -c spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \
  <APPLICATION-JAR-PATH> \
  example-repo us-east-1

Considerations

  1. In order for an object to be hard-deleted, it must be deleted from all branches. You should remove stale branches to prevent them from retaining old objects. For example, consider a branch that has been merged to main and has become stale. An object which is later deleted from main will always be present in the stale branch, preventing it from being hard-deleted.

  2. lakeFS will never delete objects outside your repository’s storage namespace. In particular, objects that were imported using lakefs import or lakectl ingest will not be affected by GC jobs.

  3. In cases where deleted objects are brought back to life while a GC job is running, said objects may or may not be deleted. Such actions include:

    1. Reverting a commit in which a file was deleted.
    2. Branching out from an old commit.
    3. Expanding the retention period of a branch.
    4. Creating a branch from an existing branch, where the new branch has a longer retention period.