Link Search Menu Expand Document

Isolated Environments

Why Do I Need Multiple Environments?

When developing over a data lake, it is useful to have replicas of your production environment. These replicas allow you to test and understand changes to your data without impacting consumers of the production data.

Running ETL and transformation jobs directly in production is a guaranteed way to have data issues flow into dashboards, ML models, and other consumers sooner or later. The most common approach to avoid making changes directly in production is to create and maintain a second data environment called development (or dev) where updates are implemented first.

The issue with this approach is that it is time-consuming and costly to maintain this separate dev environment. And for larger teams it forces multiple people to share one environment, requiring co-ordination.

How do I create isolated environments with lakeFS?

lakeFS makes it instantaneous to create isolated development environments. This frees you from spending time on environment maintenance and makes it possible to create as many environments as needed.

In a lakeFS repository, data is always located on a branch. You can think of each branch in lakeFS as its own environment. This is because branches are isolated, meaning changes on one branch have no effect other branches.

Objects that are unchanged between two branches are not copied, but rather shared to both branches via metadata pointers that lakeFS manages. If you make a change on one branch and want it reflected on another, you can perform a merge operation to update one branch with the changes from another.

Let’s show an example of using multiple lakeFS branches for isolation.

Using Branches as Environments

The key difference when using lakeFS for isolated data environments is you can create them immediately before testing a change. And once new data is merged into production, you can delete the branch, effectively deleting the old environment.

This is different from creating a long-living dev environment that is used as a staging area to test all updates. With lakeFS, we create a new branch for each change to production we want to make. (One benefit of this is the ability to test multiple changes at one time).

Prerequisites

This tutorial will use an exisitng lakeFS environment and an Apache Spark notebook.

To read more about how to setup lakeFS environment, checkout Quickstart.

Once you have a live environment, it’ll be useful to add some data into the lakeFS repo. We’ll use an Amazon review dataset from a public S3 bucket. First we’ll download the file to our local computer using the AWS CLI. Then, we’ll upload it into lakeFS using the Upload Object button in the UI.

To install the AWS CLI, follow these instructions

Download the file

aws s3 cp s3://amazon-reviews-pds/parquet/product_category=Sports/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet $HOME/

See Objects

Next, on the Objects tab of the example repo, click Upload Object then Choose File and find it in the Finder window.

Upload Object

Once it is uploaded, we’ll see the file in the repository on the main branch. Currently it is in an uncommitted state. Let’s commit it!

Commit Object

To do this we can go to the Uncommitted Changes tab and click the green Commit Changes button in the top right. Add a commit message and the file is in the version history of our lakeFS repo.

As the final setup step, we’re going to create a new branch called double-branch. To do this we can use the lakeFS UI by going to the Branches tab and clicking Create Branch. Once we create it, we’ll see two branches, main and double-branch.

Create Branch

This new branch serves as an isolated environment on which we can make changes that have no effect on main. Let’s see that in action by using…

Data Manipulation with Jupyter & Spark

This use-case discuss manipulating data using Spark, we are using Juputer. However you can integrate lakeFS with your favorite Spark environment. Let’s use them to manipulate the data on one branch, showing how it has no effect on the other.

Go to your Spark notebook UI, in our case this is the Jupyter UI.

Inside the notebook, create a new notebook and start a spark context:

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

Now we can use spark to read in the parquet file we added to the main branch of our lakeFS repo:

df = spark.read.parquet('s3a://example/main/')

To see the Dataframe, run display(df.show()). If we run display(df.count()) we’ll get returned that the Dataframe has 486k rows.

Changing one branch

Let’s accidentally write the DataFrame back to the double-branch branch, creating a duplicate object on that branch.

df.write.mode('append').parquet('s3a://example/double-branch/')

What happens if we re-read in the data on both branches and perform a count on the resulting DataFrames?

Branch Counts

As expected, there are now twice as many rows, 972k, on the double-branch branch. That means we duplicated our data! oh no!

Data duplication introduce errors into our data analytics, BI and machine learning efforts, hence we would like to avoid duplicating our data.

Do be afraid!

On the main branch however, there is still just the origin 486k rows. This shows the utility of branch-based isolated environments with lakeFS.

You can safely continue working with the data from main which is unharmed due to lakeFS isolation capabilities.