Create a Branch
lakeFS uses branches in a similar way to Git. It’s a great way to isolate changes until, or if, we are ready to re-integrate them. lakeFS uses a zero-copy branching technique which means that it’s very efficient to create branches of your data.
Having seen the lakes data in the previous step we’re now going to create a new dataset to hold data only for lakes in Denmark. Why? Well, because :)
The first thing we’ll do is create a branch for us to do this development against. We’ll use the lakectl
tool to create the branch, which we first need to configure with our credentials. In a new terminal window run the following:
docker exec -it lakefs lakectl config
Follow the prompts to enter the credentials that you got in the first step. Leave the Server endpoint URL as http://127.0.0.1:8000
.
Now that lakectl is configured, we can use it to create the branch. Run the following:
docker exec lakefs lakectl branch create lakefs://quickstart/denmark-lakes --source lakefs://quickstart/main
You should get a confirmation message like this:
Source ref: lakefs://quickstart/main
created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816
Transforming the Data
Now we’ll make a change to the data. lakeFS has several native clients, as well as an S3-compatible endpoint. This means that anything that can use S3 will work with lakeFS. Pretty neat.
We’re going to use DuckDB which is embedded within the web interface of lakeFS.
From the lakeFS Objects page select the lakes.parquet
file to open the DuckDB editor:
To start with, we’ll load the lakes data into a DuckDB table so that we can manipulate it. Replace the previous text in the DuckDB editor with this:
CREATE OR REPLACE TABLE lakes AS
SELECT * FROM READ_PARQUET('lakefs://quickstart/denmark-lakes/lakes.parquet');
You’ll see a row count of 100,000 to confirm that the DuckDB table has been created.
Just to check that it’s the same data that we saw before we’ll run the same query. Note that we are querying a DuckDB table (lakes
), rather than using a function to query a parquet file directly.
SELECT country, COUNT(*)
FROM lakes
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
Making a Change to the Data
Now we can change our table, which was loaded from the original lakes.parquet
, to remove all rows not for Denmark:
DELETE FROM lakes WHERE Country != 'Denmark';
We can verify that it’s worked by reissuing the same query as before:
SELECT country, COUNT(*)
FROM lakes
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
Write the Data back to lakeFS
The changes so far have only been to DuckDB’s copy of the data. Let’s now push it back to lakeFS. Note the path is different this time as we’re writing it to the denmark-lakes
branch, not main
:
COPY lakes TO 'lakefs://quickstart/denmark-lakes/lakes.parquet';
Verify that the Data’s Changed on the Branch
Let’s just confirm for ourselves that the parquet file itself has the new data. We’ll drop the lakes
table just to be sure, and then query the parquet file directly:
DROP TABLE lakes;
SELECT country, COUNT(*)
FROM READ_PARQUET('lakefs://quickstart/denmark-lakes/lakes.parquet')
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
What about the data in main
?
So we’ve changed the data in our denmark-lakes
branch, deleting swathes of the dataset. What’s this done to our original data in the main
branch? Absolutely nothing! See for yourself by running the same query as above, but against the main
branch:
SELECT country, COUNT(*)
FROM READ_PARQUET('lakefs://quickstart/main/lakes.parquet')
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
In the next step we’ll see how to commit our changes and merge our branch back into main.