Expected deployment time: 25min
- Creating the Database on GCP SQL
- Installation Options
- Load balancing
- Next Steps
A production-suitable lakeFS installation will require three DNS records pointing at your lakeFS server. A good convention for those will be, assuming you already own the domain
s3.lakefs.example.com- this is the S3 Gateway Domain
The second record, the S3 Gateway Domain, needs to be specified in the lakeFS configuration (see the
S3_GATEWAY_DOMAIN placeholder below). This will allow lakeFS to route requests to the S3-compatible API. For more info, see Why do I need these three DNS records?
lakeFS requires a PostgreSQL database to synchronize actions on your repositories. We will show you how to create a database on Google Cloud SQL, but you can use any PostgreSQL database as long as it’s accessible by your lakeFS installation.
If you already have a database, take note of the connection string and skip to the next step
- Follow the official Google documentation on how to create a PostgreSQL instance. Make sure you’re using PostgreSQL version >= 11.
- On the Users tab in the console, create a user to be used by the lakeFS installation.
- Choose the method by which lakeFS will connect to your database. Google recommends using the SQL Auth Proxy.
Depending on the chosen lakeFS installation method, you will need to make sure lakeFS can access your database. For example, if you install lakeFS on GKE, you need to deploy the SQL Auth Proxy from this Helm chart, or as a sidecar container in your lakeFS pod.
Save the following configuration file as
--- database: connection_string: "[DATABASE_CONNECTION_STRING]" auth: encrypt: # replace this with a randomly-generated string: secret_key: "[ENCRYPTION_SECRET_KEY]" blockstore: type: gs # Uncomment the following lines to give lakeFS access to your buckets using a service account: # gs: # credentials_json: [YOUR SERVICE ACCOUNT JSON STRING] gateways: s3: # replace this with the host you will use for the lakeFS S3-compatible endpoint: domain_name: [S3_GATEWAY_DOMAIN]
- Download the binary to the GCE instance.
- Run the
lakefsbinary on the GCE machine:
lakefs --config config.yaml run
Note: it is preferable to run the binary as a service using systemd or your operating system’s facilities.
To support container-based environments like Google Cloud Run, lakeFS can be configured using environment variables. Here is a
docker run command to demonstrate starting lakeFS using Docker:
docker run \ --name lakefs \ -p 8000:8000 \ -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \ -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \ -e LAKEFS_BLOCKSTORE_TYPE="gs" \ -e LAKEFS_GATEWAYS_S3_DOMAIN_NAME="[S3_GATEWAY_DOMAIN]" \ treeverse/lakefs:latest run
See the reference for a complete list of environment variables.
Depending on how you chose to install lakeFS, you should have a load balancer direct requests to the lakeFS server.
By default, lakeFS operates on port 8000, and exposes a
/_health endpoint which you can use for health checks.
As mentioned above, you should create 3 DNS records for lakeFS:
- One record for the lakeFS API:
- Two records for the S3-compatible API:
Depending on your DNS provider, refer to the documentation on how to add CNAME records.
Multiple DNS records are needed to access the two different lakeFS APIs (covered in more detail in the Architecture section):
- The lakeFS OpenAPI: used by the
lakectlCLI tool. Exposes git-like operations (branching, diffing, merging etc.).
- An S3-compatible API: read and write your data in any tool that can communicate with S3. Examples include: AWS CLI, Boto, Presto and Spark.
lakeFS actually exposes only one API endpoint. For every request, lakeFS checks the
Host header. If the header is under the S3 gateway domain, the request is directed to the S3-compatible API.
The third DNS record (
*.s3.lakefs.example.com) allows for virtual-host style access. This is a way for AWS clients to specify the bucket name in the Host subdomain.