Link Search Menu Expand Document

Deploy lakeFS on Azure

The instructions given here are for a self-managed deployment of lakeFS on Azure.

For a hosted lakeFS service with guaranteed SLAs, try lakeFS Cloud

When you deploy lakeFS on Azure these are the options available to use:

This guide walks you through the options available and how to configure them, finishing with configuring and running lakeFS itself and creating your first repository.

⏰ Expected deployment time: 25 min

1. Object Storage

lakeFS supports the following Azure Storage types:

  1. Azure Blob Storage
  2. Azure Data Lake Storage Gen2 (HNS)

Data Lake Storage Gen1 is not supported.

2. Authentication Method

lakeFS supports two ways to authenticate with Azure.

lakeFS uses environment variables to determine credentials to use for authentication. The following authentication methods are supported:

  1. Managed Service Identity (MSI)
  2. Service Principal RBAC
  3. Azure CLI

For deployments inside the Azure ecosystem it is recommended to use a managed identity.

More information on authentication methods and environment variables can be found here

How to Create Service Principal for Resource Group

It is recommended to create a resource group that consists of all the resources lakeFS should have access to.

Using a resource group will allow dynamic removal/addition of services from the group, effectively providing/preventing access for lakeFS to these resources without requiring any changes in configuration in lakeFS or providing lakeFS with any additional credentials.

The minimal role required for the service principal is “Storage Blob Data Contributor”

The following Azure CLI command creates a service principal for a resource group called “lakeFS” with permission to access (read/write/delete) Blob Storage resources in the resource group and with an expiry of 5 years

az ad sp create-for-rbac \
  --role "Storage Blob Data Contributor" \
  --scopes /subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/lakeFS --years 5
    
Creating 'Storage Blob Data Contributor' role assignment under scope '/subscriptions/947382ea-681a-4541-99ab-b718960c6289/resourceGroups/lakeFS'
The output includes credentials that you must protect. Be sure that you do not include these credentials in your code or check the credentials into your source control. For more information, see https://aka.ms/azadsp-cli
{
  "appId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
  "displayName": "azure-cli-2023-01-30-06-18-30",
  "password": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
  "tenant": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
}

The command output should be used to populate the following environment variables:

AZURE_CLIENT_ID      =  $appId
AZURE_TENANT_ID      =  $tenant
AZURE_CLIENT_SECRET  =  $password

Note: Service Principal credentials have an expiry date and lakeFS will lose access to resources unless credentials are renewed on time.

Note: It is possible to provide both account based credentials and environment variables to lakeFS. In that case - lakeFS will use the account credentials for any access to data located in the given account, and will try to use the identity credentials for any data located outside the given account.

Storage account credentials can be set directly in the lakeFS configuration using the following parameters:

  • blockstore.azure.storage_account
  • blockstore.azure.storage_access_key

Limitations

Please note that using this authentication method limits lakeFS to the scope of the given storage account.

Specifically, the following operations will not work:

  1. Import of data from different storage accounts
  2. Copy/Read/Write of data that was imported from a different storage account
  3. Create pre-signed URL for data that was imported from a different storage account

3. K/V Store

lakeFS stores metadata in a database for its versioning engine. This is done via a Key-Value interface that can be implemented on any DB engine and lakeFS comes with several built-in driver implementations (You can read more about it here). The database used doesn’t have to be a dedicated K/V database.

CosmosDB is a managed database service provided by Azure. lakeFS supports CosmosDB For NoSQL as a database backend.

  1. Follow the official Azure documentation on how to create a CosmosDB account for NoSQL and connect to it.
  2. Once your CosmosDB account is set up, you can create a Database for lakeFS. For lakeFS ACID guarantees, make sure to select the Bounded staleness consistency, for single region deployments.
  3. Create a new container in the database and select type partitionKey as the Partition key (case sensitive).
  4. Pass the endpoint, database name and container name to lakeFS as described in the configuration guide. You can either pass the CosmosDB’s account read-write key to lakeFS, or use a managed identity to authenticate to CosmosDB, as described earlier.

A note on CosmosDB capacity modes: lakeFS usage of CosmosDB is still in its early days and has not been battle tested. Both capacity modes, Provisioned and Serverless, has been tested for some workloads and passed with flying colors. The Provisioned mode was configured with 400-4000 RU/s.

Below we show you how to create a database on Azure Database, but you can use any PostgreSQL database as long as it’s accessible by your lakeFS installation.

If you already have a database, take note of the connection string and skip to the next step

  1. Follow the official Azure documentation on how to create a PostgreSQL instance and connect to it. Make sure that you’re using PostgreSQL version >= 11.
  2. Once your Azure Database for PostgreSQL server is set up and the server is in the Available state, take note of the endpoint and username. Azure postgres Connection String
  3. Make sure your Access control roles allow you to connect to the database instance.

4. Run the lakeFS server

Now that you’ve chosen and configured object storage, a K/V store, and authentication—you’re ready to configure and run lakeFS. There are three different ways you can run lakeFS:

Connect to your VM instance using SSH:

  1. Create a config.yaml on your VM, with the following parameters:

    ---
    database:
      type: "postgres"
      postgres:
        connection_string: "[DATABASE_CONNECTION_STRING]"
      
    auth:
      encrypt:
        # replace this with a randomly-generated string. Make sure to keep it safe!
        secret_key: "[ENCRYPTION_SECRET_KEY]"
       
    blockstore:
      type: azure
      azure:
    
  2. Download the binary to run on the VM.
  3. Run the lakefs binary:

    lakefs --config config.yaml run
    

Note: It’s preferable to run the binary as a service using systemd or your operating system’s facilities.

To support container-based environments, you can configure lakeFS using environment variables. Here is a docker run command to demonstrate starting lakeFS using Docker:

docker run \
  --name lakefs \
  -p 8000:8000 \
  -e LAKEFS_DATABASE_TYPE="postgres" \
  -e LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
  -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
  -e LAKEFS_BLOCKSTORE_TYPE="azure" \
  -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCOUNT="[YOUR_STORAGE_ACCOUNT]" \
  -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCESS_KEY="[YOUR_ACCESS_KEY]" \
  treeverse/lakefs:latest run

See the reference for a complete list of environment variables.

You can install lakeFS on Kubernetes using a Helm chart.

To install lakeFS with Helm:

  1. Copy the Helm values file relevant for Azure Blob:

    secrets:
        # replace this with the connection string of the database you created in a previous step:
        databaseConnectionString: [DATABASE_CONNECTION_STRING]
        # replace this with a randomly-generated string
        authEncryptSecretKey: [ENCRYPTION_SECRET_KEY]
    lakefsConfig: |
        blockstore:
          type: azure
          azure:
        #  If you chose to authenticate via access key, unmark the following rows and insert the values from the previous step 
        #  storage_account: [your storage account]
        #  storage_access_key: [your access key]
    
  2. Fill in the missing values and save the file as conf-values.yaml. For more configuration options, see our Helm chart README.

    The lakefsConfig parameter is the lakeFS configuration documented here but without sensitive information. Sensitive information like databaseConnectionString is given through separate parameters, and the chart will inject it into Kubernetes secrets.

  3. In the directory where you created conf-values.yaml, run the following commands:

    # Add the lakeFS repository
    helm repo add lakefs https://charts.lakefs.io
    # Deploy lakeFS
    helm install my-lakefs lakefs/lakefs -f conf-values.yaml
    

    my-lakefs is the Helm Release name.

Load balancing

To configure a load balancer to direct requests to the lakeFS servers you can use the LoadBalancer Service type or a Kubernetes Ingress. By default, lakeFS operates on port 8000 and exposes a /_health endpoint that you can use for health checks.

💡 The NGINX Ingress Controller by default limits the client body size to 1 MiB. Some clients use bigger chunks to upload objects - for example, multipart upload to lakeFS using the S3-compatible Gateway or a simple PUT request using the OpenAPI Server. Check out Nginx documentation for increasing the limit, or an example of Nginx configuration with MinIO.

Create the admin user

When you first open the lakeFS UI, you will be asked to create an initial admin user.

  1. Open http://<lakefs-host>/ in your browser. If you haven’t set up a load balancer, this will likely be http://<instance ip address>:8000/
  2. On first use, you’ll be redirected to the setup page:

    Create user

  3. Follow the steps to create an initial administrator user. Save the credentials you’ve received somewhere safe, you won’t be able to see them again!

    Setup Done

  4. Follow the link and go to the login screen. Use the credentials from the previous step to log in.

Create your first repository

  1. Use the credentials from the previous step to log in
  2. Click Create Repository and choose Blank Repository.

    Create Repo

  3. Under Storage Namespace, enter a path to your desired location on the object store. This is where data written to this repository will be stored.
  4. Click Create Repository
  5. You should now have a configured repository, ready to use!

    Repo Created

Congratulations! Your environment is now ready 🤩