Using lakeFS with Databricks

Databricks is an Apache Spark-based analytics platform.

Configuration
1. When running lakeFS inside your VPC
  1. Using multi-cluster writes
Reading Data
Writing Data
Case Study: SimilarWeb

Configuration

For Databricks to work with lakeFS, set the S3 Hadoop configuration to the lakeFS endpoint and credentials:

In databricks, go to your cluster configuration page.
Click Edit.
Expand Advanced Options
Under the Spark tab, add the following configurations, replacing <repo-name> with your lakeFS repository name. Also replace the credentials and endpoint with those of your lakeFS installation.

spark.hadoop.fs.s3a.bucket.<repo-name>.access.key AKIAIOSFODNN7EXAMPLE
spark.hadoop.fs.s3a.bucket.<repo-name>.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
spark.hadoop.fs.s3a.bucket.<repo-name>.endpoint https://lakefs.example.com
spark.hadoop.fs.s3a.path.style.access true

When using DeltaLake tables, the following is also needed in some versions of Databricks:

spark.hadoop.fs.s3a.bucket.<repo-name>.aws.credentials.provider shaded.databricks.org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
spark.hadoop.fs.s3a.bucket.<repo-name>.session.token lakefs

For more information, see the documentation from Databricks.

When running lakeFS inside your VPC

When lakeFS runs inside your private network, your Databricks cluster needs to be able to access it. This can be done by setting up a VPC peering between the two VPCs (the one where lakeFS runs, and the one where Databricks runs). For this to work on DeltaLake tables, you would also have to disable multi-cluster writes with:

spark.databricks.delta.multiClusterWrites.enabled false

Using multi-cluster writes

When using multi-cluster writes, Databricks overrides Delta’s s3-commit action. The new action tries to contact lakeFS from servers on Databricks own AWS account, which of course will not be able to access your private network. So, if you must use multi-cluster writes, your will have to allow access from Databricks’ AWS account to lakeFS. We are researching for the best ways to achieve that, and will update here soon.

Reading Data

In order for us to access objects in lakeFS we will need to use the lakeFS path conventions: s3a://[REPOSITORY]/[BRANCH]/PATH/TO/OBJECT

Here is an example for reading a parquet file from lakeFS to a Spark DataFrame:

val repo = "example-repo"
val branch = "main"
val dataPath = s"s3a://${repo}/${branch}/example-path/example-file.parquet"

val df = spark.read.parquet(dataPath)

You can now use this DataFrame like you would normally do.

Writing Data

Now simply write your results back to a lakeFS path:

df.write
  .partitionBy("example-column")
  .parquet(s"s3a://${repo}/${branch}/output-path/")

The data is now created in lakeFS as new changes in your branch. You can now commit these changes, or revert them.

Case Study: SimilarWeb

See how SimilarWeb integrated lakeFS with DataBricks.