Link Search Menu Expand Document

Using lakeFS with Databricks

Overview

Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.

In this document, we will cover the various Databricks products and how they integrate with lakeFS.

Databricks Compute Options

Databricks offers several compute options for running workloads. All of these options can be used with lakeFS. At a basic level, Databricks compute products are Spark clusters that run on top of cloud infrastructure and offer different configuration options. From a lakeFS integration perspective, the main difference is how you configure the storage operations that perform read/writes to lakeFS.

lakeFS storage operations will either use the lakeFS Hadoop Filesystem, which utilizes the lakeFS OpenAPI, or the s3a Filesystem, which uses the lakeFS S3 Gateway. In short, the lakeFS S3 Gateway is the fastest way to get started, but it routes all traffic through the lakeFS server. The lakeFS Hadoop Filesystem requires more setup, but all data transfers will occur directly on the bucket. The lakeFS Hadoop Filesystem can write to storage using the s3a filesystem or using pre-signed URLs generated by the lakeFS server. To read more about the alternatives, see the Spark integration page.

All-Purpose Compute

Provisioned compute used to analyze data in notebooks.

When you create a Databricks compute cluster, you can configure it to use lakeFS with the lakeFS Hadoop Filesystem (see Databricks installation guide) or the lakeFS S3 Gateway. The lakeFS S3 Gateway can be configured in the notebook or during cluster setup (Advanced Options -> Spark -> Spark config).

Jobs Compute

Provisioned compute used to run automated jobs. The Databricks job scheduler automatically creates a job compute whenever a job is configured to run on new compute.

To use lakeFS with Databricks jobs, a compute cluster needs to be configured in the cluster setup, just like with All-Purpose compute.

{.note} Note Serverless compute for Databricks jobs is currently not supported.

SQL Warehouses

Classic & Pro warehouses are used to run SQL commands on data objects in the SQL editor or interactive notebooks. Serverless warehouses do the same, except that they are on-demand elastic compute.

All warehouses do not allow the installation of external jars, such as the lakeFS Hadoop Filesystem. To use SQL warehouses with lakeFS, utilize the lakeFS S3 Gateway.

Unity Catalog

Unity Catalog is Databricks’ metastore that provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces.

lakeFS can be used with Unity Catalog to provide a versioned view of the data and the Unity Catalog metadata.

lakeFS support for Unity Catalog differs between lakeFS OSS and lakeFS Enterprise & Cloud.

Catalog Exports lakeFS

Leveraging the external tables feature within Unity Catalog, you can register a Delta Lake table exported from lakeFS and access it through the unified catalog.

{.note} Note lakeFS Catalog exporters offer read-only table exports.

Catalog Exports relies on lakeFS Actions and offers a way to export changes from lakeFS to Unity Catalog.

For the full guide on how to use Catalog Exports with Unity Catalog, see the documentation.

lakeFS for Databricks Enterprise

lakeFS for Databricks provides a seamless integration between lakeFS and Unity Catalog. Its primary benefits over the integration offered by lakeFS open-source are:

  • Table Write support
  • Native Unity Catalog interaction: Instead of reading/writing from lakeFS path, use SQL to directly access data stored in Unity catalog
  • Advanced Serverless warehouse support: lakeFS can work with anything serverless - SQL warehouse or Serverless notebooks.

For more information, visit the lakeFS for Databricks page.

Delta Lake

lakeFS supports Delta Lake tables, and it provides a versioned view of the data and the Delta Lake metadata. Read the Delta Lake docs for more information.