Using lakeFS with Delta Lake
Delta Lake Delta Lake is an open-source storage framework designed to improve performance and provide transactional guarantees to data lake tables.
Because lakeFS is format-agnostic, you can save data in Delta format within a lakeFS repository and benefit from the advantages of both technologies. Specifically:
- ACID operations can span across many Delta tables.
- CI/CD hooks can validate Delta table contents, schema, or even referential integrity.
- lakeFS supports zero-copy branching for quick experimentation with full isolation.
Using Delta Lake with lakeFS from Apache Spark
Given the native integration between Delta Lake and Spark, it’s most common that you’ll interact with Delta tables in a Spark environment.
To configure a Spark environment to read from and write to a Delta table within a lakeFS repository, you need to set the proper credentials and endpoint in the S3 Hadoop configuration, like you’d do with any Spark environment.
Once set, you can interact with Delta tables using regular Spark path URIs. Make sure that you include the lakeFS repository and branch name:
df.write.format("delta").save("s3a://<repo-name>/<branch-name>/path/to/delta-table")
Note: If using the Databricks Analytics Platform, see the integration guide for configuring a Databricks cluster to use lakeFS.
To see the integration in action see this notebook in the lakeFS Samples Repository.
Using Delta Lake with lakeFS from Python
The delta-rs library provides bindings for Python. This means that you can use Delta Lake and lakeFS directly from Python without needing Spark. Integration is done through the lakeFS S3 Gateway
The documentation for the deltalake
Python module details how to read, write, and query Delta Lake tables. To use it with lakeFS use an s3a
path for the table based on your repository and branch (for example, s3a://delta-lake-demo/main/my_table/
) and specify the following storage_options
:
storage_options = {"AWS_ENDPOINT": <your lakeFS endpoint>,
"AWS_ACCESS_KEY_ID": <your lakeFS access key>,
"AWS_SECRET_ACCESS_KEY": <your lakeFS secret key>,
"AWS_REGION": "us-east-1",
"AWS_S3_ALLOW_UNSAFE_RENAME": "true"
}
If your lakeFS is not using HTTPS (for example, you’re just running it locally) then add the option
"AWS_STORAGE_ALLOW_HTTP": "true"
To see the integration in action see this notebook in the lakeFS Samples Repository.
Viewing Delta Lake table changes in lakeFS BETA
Using lakeFS you can
- Compare different versions of Delta Lake tables
- Get a detailed view of all Delta Lake table operations performed since the tables diverged.
For example, comparing branches dev
and main
, we can see that the movies table has changed on dev
since the branches diverged.
Expanding the delete operation, we learn that all movies with a rating < 4 were deleted from the table on the dev
branch.
Note: The diff is available as long as the table history in Delta is retained (30 days by default). A delta lake table history is derived from the delta log JSON files.
Installing the Delta Lake diff plugin
To enable the Delta Lake diff feature, you need to install a plugin on the lakeFS server. You will find the plugin binary in the
release tarball (versions >= 0.97.3).
Rename the delta_diff
binary to delta
and put it under ~/.lakefs/plugins/diff
on the machine where lakeFS is running.
You can customize the location of the Delta Lake diff plugin by changing the diff.delta.plugin
and
plugin.properties.<plugin name>.path
configurations in the .lakefs.yaml
file.
Notice: If you’re using the lakeFS docker image, the plugin is installed by default.
Best Practices
Production workflows should ideally write to a single lakeFS branch that could then be safely merged into main
. This is because the Delta log is an auto-generated sequence of text files used to keep track of transactions on a Delta table sequentially. Writing to one Delta table from multiple lakeFS branches is possible, but note that it would result in conflicts if later attempting to merge one branch into the other.
When running lakeFS inside your VPC (on AWS)
When lakeFS runs inside your private network, your Databricks cluster needs to be able to access it. This can be done by setting up a VPC peering between the two VPCs (the one where lakeFS runs and the one where Databricks runs). For this to work on Delta Lake tables, you would also have to disable multi-cluster writes with:
spark.databricks.delta.multiClusterWrites.enabled false
Using multi cluster writes (on AWS)
When using multi-cluster writes, Databricks overrides Delta’s S3-commit action. The new action tries to contact lakeFS from servers on Databricks’ own AWS account, which of course won’t be able to access your private network. So, if you must use multi-cluster writes, you’ll have to allow access from Databricks’ AWS account to lakeFS. If you are trying to achieve that, please reach out on Slack and the community will try to assist.
Further Reading
See Guaranteeing Consistency in Your Delta Lake Tables With lakeFS post on the lakeFS blog to learn how to guarantee data quality in a Delta table by utilizing lakeFS branches.