Link Search Menu Expand Document

Deploy lakeFS on AWS

Expected deployment time: 25min

Table of contents

  1. Prerequisites
  2. Creating the Database on AWS RDS
  3. Installation Options
    1. On EC2
    2. On ECS
    3. On EKS
  4. Load balancing
  5. DNS on AWS Route53
  6. Next Steps

Prerequisites

A production-suitable lakeFS installation will require three DNS records pointing at your lakeFS server. A good convention for those will be, assuming you already own the domain example.com:

  • lakefs.example.com
  • s3.lakefs.example.com - this is the S3 Gateway Domain
  • *.s3.lakefs.example.com

The second record, the S3 Gateway Domain, needs to be specified in the lakeFS configuration (see the S3_GATEWAY_DOMAIN placeholder below). This will allow lakeFS to route requests to the S3-compatible API. For more info, see Why do I need these three DNS records?

Creating the Database on AWS RDS

lakeFS requires a PostgreSQL database to synchronize actions on your repositories. We will show you how to create a database on AWS RDS, but you can use any PostgreSQL database as long as it’s accessible by your lakeFS installation.

If you already have a database, take note of the connection string and skip to the next step

  1. Follow the official AWS documentation on how to create a PostgreSQL instance and connect to it.
    You may use the default PostgreSQL engine, or Aurora PostgreSQL. Make sure you’re using PostgreSQL version >= 11.
  2. Once your RDS is set up and the server is in Available state, take note of the endpoint and port.

    RDS Connection String

  3. Make sure your security group rules allow you to connect to the database instance.

Installation Options

On EC2

  1. Save the following configuration file as config.yaml:

    ---
    database:
      connection_string: "[DATABASE_CONNECTION_STRING]"
    auth:
      encrypt:
        # replace this with a randomly-generated string:
        secret_key: "[ENCRYPTION_SECRET_KEY]"
    blockstore:
      type: s3
      s3:
        region: us-east-1
    gateways:
      s3:
         # replace this with the host you will use for the lakeFS S3-compatible endpoint:
         domain_name: [S3_GATEWAY_DOMAIN]
    
  2. Download the binary to the EC2 instance.
  3. Run the lakefs binary on the EC2 instance:
    lakefs --config config.yaml run
    

    Note: it is preferable to run the binary as a service using systemd or your operating system’s facilities.

On ECS

To support container-based environments like AWS ECS, lakeFS can be configured using environment variables. Here is a docker run command to demonstrate starting lakeFS using Docker:

docker run \
  --name lakefs \
  -p 8000:8000 \
  -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
  -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
  -e LAKEFS_BLOCKSTORE_TYPE="s3" \
  -e LAKEFS_GATEWAYS_S3_DOMAIN_NAME="[S3_GATEWAY_DOMAIN]" \
  treeverse/lakefs:latest run

See the reference for a complete list of environment variables.

On EKS

See Kubernetes Deployment.

Load balancing

Depending on how you chose to install lakeFS, you should have a load balancer direct requests to the lakeFS server.
By default, lakeFS operates on port 8000, and exposes a /_health endpoint which you can use for health checks.

Notes for using an AWS Application Load Balancer

  1. Your security groups should allow the load balancer to access the lakeFS server.
  2. Create a target group with a listener for port 8000.
  3. Setup TLS termination using the domain names you wish to use for both endpoints (e.g. s3.lakefs.example.com, *.s3.lakefs.example.com, lakefs.example.com).
  4. Configure the health-check to use the exposed /_health URL

DNS on AWS Route53

As mentioned above, you should create 3 DNS records for lakeFS:

  1. One record for the lakeFS API: lakefs.example.com
  2. Two records for the S3-compatible API: s3.lakefs.example.com and *.s3.lakefs.example.com.

For an AWS load balancer with Route53 DNS, create a simple record, and choose Alias to Application and Classic Load Balancer with an A record type.

Configuring a simple record in Route53

For other DNS providers, refer to the documentation on how to add CNAME records.

Next Steps

Your next step is to prepare your storage. If you already have a storage bucket/container, you are ready to create your first lakeFS repository.

Why do I need the three DNS records?

Multiple DNS records are needed to access the two different lakeFS APIs (covered in more detail in the Architecture section):

  1. The lakeFS OpenAPI: used by the lakectl CLI tool. Exposes git-like operations (branching, diffing, merging etc.).
  2. An S3-compatible API: read and write your data in any tool that can communicate with S3. Examples include: AWS CLI, Boto, Presto and Spark.

lakeFS actually exposes only one API endpoint. For every request, lakeFS checks the Host header. If the header is under the S3 gateway domain, the request is directed to the S3-compatible API.

The third DNS record (*.s3.lakefs.example.com) allows for virtual-host style access. This is a way for AWS clients to specify the bucket name in the Host subdomain.