Link Search Menu Expand Document

Installing lakeFS

For production deployments, install the lakeFS binary on your host of choice.

Preqrequisites

A production-suitable lakeFS installation will require three DNS records pointing at your lakeFS server. A good convention for those will be, assuming you already own the domain example.com:

  • lakefs.example.com
  • s3.lakefs.example.com - this is the S3 Gateway Domain
  • *.s3.lakefs.example.com

The second record, the S3 Gateway Domain, is used in lakeFS configuration to differentiate between the S3 Gateway API and the OpenAPI Server. For more info, see Why do I need these three DNS records?

Find your preferred installation method:

  1. Kubernetes with Helm
  2. Docker
  3. AWS ECS / Google Cloud Run / Azure Container Instances
  4. AWS EC2 / Google Compute Engine / Azure Virtual Machine

Kubernetes with Helm

lakeFS can be easily installed on Kubernetes using a Helm chart. To install lakeFS with Helm:

  1. Copy the Helm values file relevant to your cloud provider:
    secrets:
     # replace DATABASE_CONNECTION_STRING with the connection string of the database you created in a previous step.
     # e.g. postgres://postgres:myPassword@my-lakefs-db.rds.amazonaws.com:5432/lakefs
     databaseConnectionString: [DATABASE_CONNECTION_STRING]
     # replace this with a randomly-generated string
     authEncryptSecretKey: [ENCRYPTION_SECRET_KEY]
    lakefsConfig: |
     blockstore:
       type: s3
       s3:
         region: us-east-1
     gateways:
       s3:
         # replace this with the host you will use for the lakeFS S3-compatible endpoint:
         domain_name: [S3_GATEWAY_DOMAIN]
    
    secrets:
     # replace DATABASE_CONNECTION_STRING with the connection string of the database you created in a previous step.
     # e.g.: postgres://postgres:myPassword@localhost/postgres:5432
     databaseConnectionString: [DATABASE_CONNECTION_STRING]
     # replace this with a randomly-generated string
     authEncryptSecretKey: [ENCRYPTION_SECRET_KEY]
    lakefsConfig: |
     blockstore:
       type: gs
     # Uncomment the following lines to give lakeFS access to your buckets using a service account:
     # gs:
     #   credentials_json: [YOUR SERVICE ACCOUNT JSON STRING]
     gateways:
       s3:
         # replace this with the host you will use for the lakeFS S3-compatible endpoint:
         domain_name: [S3_GATEWAY_DOMAIN]
    

    Notes for running lakeFS on GKE

    • To connect to your database, you need to use one of the ways of connecting GKE to Cloud SQL.
    • To give lakeFS access to your bucket, you can start the cluster in storage-rw mode. Alternatively, you can use a service account JSON string by uncommenting the gs.credentials_json property in the following yaml.
    secrets:
     # replace this with the connection string of the database you created in a previous step:
     databaseConnectionString: [DATABASE_CONNECTION_STRING]
     # replace this with a randomly-generated string
     authEncryptSecretKey: [ENCRYPTION_SECRET_KEY]
    lakefsConfig: |
     blockstore:
       type: azure
       azure:
         auth_method: msi # msi for active directory, access-key for access key 
      #  In case you chose to authenticate via access key unmark the following rows and insert the values from the previous step 
      #  storage_account: [your storage account]
      #  storage_access_key: [your access key]
     gateways:
       s3:
         # replace this with the host you will use for the lakeFS S3-compatible endpoint:
         domain_name: s3.lakefs.example.com
    
  2. Fill in the missing values and save the file as conf-values.yaml. For more configuration options, see our Helm chart README.

    The lakefsConfig parameter is the lakeFS configuration documented here, but without sensitive information. Sensitive information like databaseConnectionString is given through separate parameters, and the chart will inject them into Kubernetes secrets.

  3. In the directory where you created conf-values.yaml, run the following commands:

     # Add the lakeFS repository
     helm repo add lakefs https://charts.lakefs.io
     # Deploy lakeFS
     helm install example-lakefs lakefs/lakefs -f conf-values.yaml
    

    example-lakefs is the Helm Release name.

You should give your Kubernetes nodes access to all buckets/containers you intend to use lakeFS with. If you can’t provide such access, lakeFS can be configured to use an AWS key-pair, an Azure access key, or a Google Cloud credentials file to authenticate (part of the lakefsConfig YAML below).

Once your installation is running, move on to Load Balancing and DNS.

Docker

To deploy using Docker, create a yaml configuration file. Here is a minimal example, but you can see the reference for the full list of configurations.

database:
  connection_string: "[DATABASE_CONNECTION_STRING]"
auth:
  encrypt:
    secret_key: "[ENCRYPTION_SECRET_KEY]"
blockstore:
  type: s3
gateways:
  s3:
    domain_name: "[S3_GATEWAY_DOMAIN]"
database:
  connection_string: "[DATABASE_CONNECTION_STRING]"
auth:
  encrypt:
    secret_key: "[ENCRYPTION_SECRET_KEY]"
blockstore:
  type: gs
# Uncomment the following lines to give lakeFS access to your buckets using a service account:
# gs:
#   credentials_json: [YOUR SERVICE ACCOUNT JSON STRING]
gateways:
  s3:
    domain_name: "[S3_GATEWAY_DOMAIN]"
database:
  connection_string: "postgres://user:pass@<AZURE_POSTGRES_SERVER_NAME>..."
auth:
  encrypt:
    secret_key: "<RANDOM_GENERATED_STRING>"
blockstore:
  type: azure
  azure:
    auth_method: msi # msi for active directory, access-key for access key 
    #  In case you chose to authenticate via access key replace unmark the following rows and insert the values from the previous step 
    #  storage_account: <your storage account>
    #  storage_access_key: <your access key>
gateways:
  s3:
    domain_name: s3.lakefs.example.com

Save the configuration file locally as lakefs-config.yaml and run the following command:

docker run \
  --name lakefs \
  -p 8000:8000 \
  -v $(pwd)/lakefs-config.yaml:/home/lakefs/.lakefs.yaml \
  treeverse/lakefs:latest run

Once your installation is running, move on to Load Balancing and DNS.

AWS ECS / Google Cloud Run / Azure Container Instances

Some environments make it harder to use a configuration file, and are best configured using environment variables. All lakeFS configurations can be given through environment variables, see the reference for the full list of configurations.

These configurations can be used to run lakeFS on container orchestration service providers like AWS ECS, Google Cloud Run , or Azure Container Instances. Here is a docker run command to demonstrate the use of environment variables:

docker run \
  --name lakefs \
  -p 8000:8000 \
  -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
  -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
  -e LAKEFS_BLOCKSTORE_TYPE="s3" \
  -e LAKEFS_GATEWAYS_S3_DOMAIN_NAME="[S3_GATEWAY_DOMAIN]" \
  treeverse/lakefs:latest run
docker run \
  --name lakefs \
  -p 8000:8000 \
  -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
  -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
  -e LAKEFS_BLOCKSTORE_TYPE="gs" \
  -e LAKEFS_GATEWAYS_S3_DOMAIN_NAME="[S3_GATEWAY_DOMAIN]" \
  treeverse/lakefs:latest run
docker run \
  --name lakefs \
  -p 8000:8000 \
  -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
  -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
  -e LAKEFS_BLOCKSTORE_TYPE="azure" \
  -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCOUNT="[YOUR_STORAGE_ACCOUNT]" \
  -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCESS_KEY="[YOUR_ACCESS_KEY]" \
  -e LAKEFS_GATEWAYS_S3_DOMAIN_NAME="[S3_GATEWAY_DOMAIN]" \
  treeverse/lakefs:latest run

Once your installation is running, move on to Load Balancing and DNS.

AWS EC2 / Google Compute Engine / Azure Virtual Machine

Run lakeFS directly on a cloud instance:

  1. Download the binary for your operating system
  2. lakefs is a single binary, you can run it directly, but preferably run it as a service using systemd or your operating system’s facilities.

    lakefs --config <PATH_TO_CONFIG_FILE> run
    
  3. To support azure AD authentication go to Identity tab and switch Status toggle to on, then add the `Storage Blob Data Contributer’ role on the container you created.

Once your installation is running, move on to Load Balancing and DNS.

Why do I need the three DNS records?

Multiple DNS records are needed to access the two different lakeFS APIs (covered in more detail in the Architecture section):

  1. The lakeFS OpenAPI: used by the lakectl CLI tool. Exposes git-like operations (branching, diffing, merging etc.).
  2. An S3-compatible API: read and write your data in any tool that can communicate with S3. Examples include: AWS CLI, Boto, Presto and Spark.

lakeFS actually exposes only one API endpoint. For every request, lakeFS checks the Host header. If the header is under the S3 gateway domain, the request is directed to the S3-compatible API.

The third DNS record (*.s3.lakefs.example.com) allows for virtual-host style access. This is a way for AWS clients to specify the bucket name in the Host subdomain.