Using lakeFS with Databricks

Databricks is an Apache Spark-based analytics platform.

Configuration
1. When running lakeFS inside your VPC
  1. Using multi-cluster writes
Reading Data
Writing Data
Case Study: SimilarWeb

Configuration

For Databricks to work with lakeFS, set the S3 Hadoop configuration to the lakeFS endpoint and credentials:

In Databricks, go to your cluster configuration page.
Click Edit.
Expand Advanced Options.
Under the Spark tab, add the following configuration, replacing <repo-name> with your lakeFS repository name. Also, replace the credentials and endpoint with those of your lakeFS installation.

spark.hadoop.fs.s3a.bucket.<repo-name>.access.key AKIAIOSFODNN7EXAMPLE
spark.hadoop.fs.s3a.bucket.<repo-name>.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
spark.hadoop.fs.s3a.bucket.<repo-name>.endpoint https://lakefs.example.com
spark.hadoop.fs.s3a.path.style.access true

When using DeltaLake tables, the following may also be required in some versions of Databricks:

spark.hadoop.fs.s3a.bucket.<repo-name>.aws.credentials.provider shaded.databricks.org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
spark.hadoop.fs.s3a.bucket.<repo-name>.session.token lakefs

For more information, please see the documentation from Databricks.

When running lakeFS inside your VPC

When lakeFS runs inside your private network, your Databricks cluster needs to be able to access it. This can be done by setting up a VPC peering between the two VPCs (the one where lakeFS runs and the one where Databricks runs). For this to work on DeltaLake tables, you would also have to disable multi-cluster writes with:

spark.databricks.delta.multiClusterWrites.enabled false

Using multi-cluster writes

When using multi-cluster writes, Databricks overrides Delta’s S3-commit action. The new action tries to contact lakeFS from servers on Databricks’ own AWS account, which of course won’t be able to access your private network. So, if you must use multi-cluster writes, you’ll have to allow access from Databricks’ AWS account to lakeFS. If you are trying to achieve that, please reach out on Slack and the community will try to assist.

Reading Data

To access objects in lakeFS. you’ll need to use the lakeFS path conventions: s3a://[REPOSITORY]/[BRANCH]/PATH/TO/OBJECT

Here is an example for reading a parquet file from lakeFS to a Spark DataFrame:

val repo = "example-repo"
val branch = "main"
val dataPath = s"s3a://${repo}/${branch}/example-path/example-file.parquet"

val df = spark.read.parquet(dataPath)

You can now use this DataFrame like you would normally do.

Writing Data

Now simply write your results back to a lakeFS path:

df.write
  .partitionBy("example-column")
  .parquet(s"s3a://${repo}/${branch}/output-path/")

The data is now created in lakeFS as new changes in your branch. You can now commit these changes or revert them.

Case Study: SimilarWeb

See how SimilarWeb integrated lakeFS with Databricks.