Using lakeFS with Apache Airflow¶
Apache Airflow is a platform that allows users to programmatically author, schedule, and monitor workflows.
To run Airflow with lakeFS, you need to follow a few steps.
Create a lakeFS connection on Airflow¶
To access the lakeFS server and authenticate with it, create a new Airflow Connection of type HTTP and add it to your DAG. You can do that using the Airflow UI or the CLI. Here’s an example Airflow command that does just that:
airflow connections add conn_lakefs --conn-type=HTTP --conn-host=http://<LAKEFS_ENDPOINT> \
--conn-extra='{"access_key_id":"<LAKEFS_ACCESS_KEY_ID>","secret_access_key":"<LAKEFS_SECRET_ACCESS_KEY>"}'
Install the lakeFS Airflow package¶
You can use pip
to install the package
Use the package¶
Operators¶
The package exposes several operations to interact with a lakeFS server:
CreateBranchOperator
creates a new lakeFS branch from the source branch (main
by default).CommitOperator
commits uncommitted changes to a branch.MergeOperator
merges 2 lakeFS branches.
Sensors¶
Sensors are also available that allow synchronizing a running DAG with external operations:
CommitSensor
waits until a commit has been applied to the branchFileSensor
waits until a given file is present on a branch.
Example¶
This example DAG in the airflow-provider-lakeFS repository shows how to use all of these.
Performing other operations¶
Sometimes an operator might not be supported by airflow-provider-lakeFS yet. You can access lakeFS directly by using:
SimpleHttpOperator
to send API requests to lakeFS.BashOperator
with lakectl commands.
For example, deleting a branch using BashOperator
: