Using lakeFS with the Unity Catalog
Overview
Databricks Unity Catalog serves as a centralized data governance platform for your data lakes. Through the Unity Catalog, you can search for and locate data assets across workspaces via a unified catalog. Leveraging the external tables feature within Unity Catalog, you can register a Delta Lake table exported from lakeFS and access it through the unified catalog. The subsequent step-by-step guide will lead you through the process of configuring a Lua hook that exports Delta Lake tables from lakeFS, and subsequently registers them in Unity Catalog.
Currently, Unity Catalog export feature exclusively supports AWS S3 and Azure ADLS Gen2 as the underlying storage solution. It’s planned to support other cloud providers soon.
Reference Guide: lakeFS + Unity Catalog Integration: Step-by-Step Tutorial
Prerequisites
Before starting, ensure you have the following:
- Access to Unity Catalog
- An active lakeFS installation with S3 as the backing storage, and a repository in this installation.
- A Databricks SQL warehouse.
- AWS Credentials with S3 access.
- lakeFS credentials with access to your Delta Tables.
Supported from lakeFS v1.4.0
Databricks authentication
Given that the hook will ultimately register a table in Unity Catalog, authentication with Databricks is imperative. Make sure that:
- You have a Databricks Service Principal.
- The Service principal has token usage permissions, and an associated token configured.
- The service principal has the
Service principal: Manager
privilege over itself (Workspace: Admin console -> Service principals -><service principal>
-> Permissions -> Grant access (<service principal>
:Service principal: Manager
), withWorkspace access
andDatabricks SQL access
checked (Admin console -> Service principals -><service principal>
-> Configurations). - Your SQL warehouse allows the service principal to use it (SQL Warehouses ->
<SQL warehouse>
-> Permissions -><service principal>
:Can use
). - The catalog grants the
USE CATALOG
,USE SCHEMA
,CREATE SCHEMA
permissions to the service principal(Catalog -><catalog name>
-> Permissions -> Grant -><service principal>
:USE CATALOG
,USE SCHEMA
,CREATE SCHEMA
). - You have an External Location configured, and the service principal has the
CREATE EXTERNAL TABLE
permission over it (Catalog -> External Data -> External Locations -> Create location).
Guide
Table descriptor definition
To guide the Unity Catalog exporter in configuring the table in the catalog, define its properties in the Delta Lake table descriptor. The table descriptor should include (at minimum) the following fields:
name
: The table name.type
: Should bedelta
.catalog
: The name of the catalog in which the table will be created.path
: The path in lakeFS (starting from the root of the branch) in which the Delta Lake table’s data is found.
Let’s define the table descriptor and upload it to lakeFS:
Save the following as famous-people-td.yaml
:
---
name: famous_people
type: delta
catalog: my-catalog-name
path: tables/famous-people
It’s recommended to create a Unity catalog with the same name as your repository
Upload the table descriptor to _lakefs_tables/famous-people-td.yaml
and commit:
lakectl fs upload lakefs://repo/main/_lakefs_tables/famous-people-td.yaml -s ./famous-people-td.yaml && \
lakectl commit lakefs://repo/main -m "add famous people table descriptor"
Write some data
Insert data into the table path, using your preferred method (e.g. Spark), and commit upon completion.
We shall use Spark and lakeFS’s S3 gateway to write some data as a Delta table:
pyspark --packages "io.delta:delta-spark_2.12:3.0.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262" \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.hadoop.fs.s3a.aws.credentials.provider='org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider' \
--conf spark.hadoop.fs.s3a.endpoint='<LAKEFS_SERVER_URL>' \
--conf spark.hadoop.fs.s3a.access.key='<LAKEFS_ACCESS_KEY>' \
--conf spark.hadoop.fs.s3a.secret.key='<LAKEFS_SECRET_ACCESS_KEY>' \
--conf spark.hadoop.fs.s3a.path.style.access=true
data = [
('James','Bond','England','intelligence'),
('Robbie','Williams','England','music'),
('Hulk','Hogan','USA','entertainment'),
('Mister','T','USA','entertainment'),
('Rafael','Nadal','Spain','professional athlete'),
('Paul','Haver','Belgium','music'),
]
columns = ["firstname","lastname","country","category"]
df = spark.createDataFrame(data=data, schema = columns)
df.write.format("delta").mode("overwrite").partitionBy("category", "country").save("s3a://repo/main/tables/famous-people")
The Unity Catalog exporter script
For code references check delta_exporter and unity_exporter docs.
Create unity_exporter.lua
:
local aws = require("aws")
local formats = require("formats")
local databricks = require("databricks")
local delta_export = require("lakefs/catalogexport/delta_exporter")
local unity_export = require("lakefs/catalogexport/unity_exporter")
local sc = aws.s3_client(args.aws.access_key_id, args.aws.secret_access_key, args.aws.region)
-- Export Delta Lake tables export:
local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key, args.aws.region)
local delta_table_details = delta_export.export_delta_log(action, args.table_defs, sc.put_object, delta_client, "_lakefs_tables")
-- Register the exported table in Unity Catalog:
local databricks_client = databricks.client(args.databricks_host, args.databricks_token)
local registration_statuses = unity_export.register_tables(action, "_lakefs_tables", delta_table_details, databricks_client, args.warehouse_id)
for t, status in pairs(registration_statuses) do
print("Unity catalog registration for table \"" .. t .. "\" completed with commit schema status : " .. status .. "\n")
end
Upload the lua script to the main
branch under scripts/unity_exporter.lua
and commit:
lakectl fs upload lakefs://repo/main/scripts/unity_exporter.lua -s ./unity_exporter.lua && \
lakectl commit lakefs://repo/main -m "upload unity exporter script"
Action configuration
Define an action configuration that will run the above script after a commit is completed (post-commit
) over the main
branch.
Create unity_exports_action.yaml
:
---
name: unity_exports
on:
post-commit:
branches: ["main"]
hooks:
- id: unity_export
type: lua
properties:
script_path: scripts/unity_exporter.lua
args:
aws:
access_key_id: <AWS_ACCESS_KEY_ID>
secret_access_key: <AWS_SECRET_ACCESS_KEY>
region: <AWS_REGION>
lakefs: # provide credentials of a user that has access to the script and Delta Table
access_key_id: <LAKEFS_ACCESS_KEY_ID>
secret_access_key: <LAKEFS_SECRET_ACCESS_KEY>
table_defs: # an array of table descriptors used to be defined in Unity Catalog
- famous-people-td
databricks_host: <DATABRICKS_HOST_URL>
databricks_token: <DATABRICKS_SERVICE_PRINCIPAL_TOKEN>
warehouse_id: <WAREHOUSE_ID>
Upload the action configurations to _lakefs_actions/unity_exports_action.yaml
and commit:
Once the commit will finish its run, the action will start running since we’ve configured it to run on
post-commit
events on themain
branch.
lakectl fs upload lakefs://repo/main/_lakefs_actions/unity_exports_action.yaml -s ./unity_exports_action.yaml && \
lakectl commit lakefs://repo/main -m "upload action and run it"
The action has run and exported the famous_people
Delta Lake table to the repo’s storage namespace, and has register
the table as an external table in Unity Catalog under the catalog my-catalog-name
, schema main
(as the branch’s name) and
table name famous_people
: my-catalog-name.main.famous_people
.
Databricks Integration
After registering the table in Unity, you can leverage your preferred method to query the data
from the exported table under my-catalog-name.main.famous_people
, and view it in the Databricks’s Catalog Explorer, or
retrieve it using the Databricks CLI with the following command:
databricks tables get my-catalog-name.main.famous_people