Link Search Menu Expand Document

Using lakeFS with Delta Lake

Delta Lake is an open file format designed to improve performance and provide transactional guarantees to data lake tables.

lakeFS is format-agnostic, so you can save data in Delta format within a lakeFS repository to reap the benefits of both technologies. Specifically:

  1. ACID operations can now span across many Delta tables.
  2. CI/CD hooks can validate Delta table contents, schema, or even referential integrity.
  3. lakeFS supports zero-copy branching for quick experimentation with full isolation.

Table of contents

  1. Configuration
  2. Limitations
  3. Read more

Configuration

Most commonly, you interact with Delta tables in a Spark environment - given the native integration between Delta Lake and Spark.

To configure a Spark environment to read from and write to a Delta table within a lakeFS repository, you need to set the proper credentials and endpoint in the S3 Hadoop configuration, like you’d do with any Spark script.

 sc.hadoopConfiguration.set("spark.hadoop.fs.s3a.path.style.access", "true")
 sc.hadoopConfiguration.set("spark.hadoop.fs.s3a.bucket.<repo-name>.access.key", "AKIAIOSFODNN7EXAMPLE")
 sc.hadoopConfiguration.set("spark.hadoop.fs.s3a.bucket.<repo-name>.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
 sc.hadoopConfiguration.set("spark.hadoop.fs.s3a.bucket.<repo-name>.endpoint", "https://lakefs.example.com")

Once set, you can interact with Delta tables using regular Spark path URIs. Make sure that you include the lakeFS repository and branch name:

data.write.format("delta").save("s3a://<repo-name>/<branch-name>/path/to/delta-table")

Note: If using the Databricks Analytics Platform, see the integration guide for configuring a Databricks cluster to use lakeFS.

Limitations

The Delta log is an auto-generated sequence of text files used to keep track of transactions on a Delta table sequentially. Writing to one Delta table from multiple lakeFS branches is possible, but note that it would result in conflicts if later attempting to merge one branch into the other. For this reason, production workflows should ideally write to a single lakeFS branch that could then be safely merged into main.

Read more

See this post on the lakeFS blog that shows how to guarantee data quality in a Delta table by utilizing lakeFS branches.