Link Search Menu Expand Document

Sizing guide

Table of contents

  1. System Requirements
    1. Operating Systems and ISA
    2. Memory and CPU requirements
    3. Network
    4. Disk
    5. PostgreSQL database
      1. Storage
      2. RAM and shared_buffers
      3. CPU
  2. Scaling factors
    1. Understanding latency and throughput considerations
  3. Benchmarks
    1. Random reads
    2. Random Writes
    3. Branch creation
  4. Important metrics
  5. Reference architectures
    1. Reference Architecture: Data Science/Research environment
    2. Reference Architecture: Automated Production Pipelines

System Requirements

Operating Systems and ISA

lakeFS can run on MacOS and Linux (Windows binaries are available but not rigorously tested - we don’t recommend deploying lakeFS to production on Windows). x86_64 and arm64 architectures are supported for both MacOS and Linux.

Memory and CPU requirements

lakeFS servers require a minimum of 512mb of RAM and 1 CPU core. For high throughput, additional CPUs help scale requests across different cores. “Expensive” operations such as large diff or commit operations can take advantage of multiple cores.

Network

If using the data APIs such as the S3 Gateway, lakeFS will require enough network bandwidth to support the planned concurrent network upload/download operations. For most cloud providers, more powerful machines (i.e. more expensive and usually with more CPU cores) also provide increased network bandwidth.

If using only the metadata APIs (for example, only using the Hadoop/Spark clients), network bandwidth is minimal, at roughly 1Kb per request.

Disk

lakeFS greatly benefits from fast local disks. A lakeFS instance doesn’t require any strong durability guarantees from the underlying storage, as the disk is only ever used as a local caching layer for lakeFS metadata, and not for long-term storage. lakeFS is designed to work with ephemeral disks - these are usually based on NVMe and are tied to the machine’s lifecycle. Using ephemeral disks lakeFS can provide a very high throughput/cost ratio, probably the best that could be achieved on a public cloud, so we recommend those.

A local cache of at least 512 MiB should be provided. For large installations (managing >100 concurrently active branches, with >100M objects per commit), we recommend allocating at least 10 GiB - since it’s a caching layer over a relatively slow storage (the object store), see Important metrics below to understand how to size this: it should be big enough to hold all commit metadata for actively referenced commits.

PostgreSQL database

lakeFS uses a PostgreSQL instance to manage branch references, authentication and authorization information and to keep track of currently uncommitted data across branches.

Storage

The dataset stored in PostgreSQL is relatively modest, as most metadata is pushed down into the object store. Required storage is mostly a factor of the amount of uncommitted writes across all branches at any given point in time: in the range of 150 MiB per every 100,000 uncommitted writes.

We recommend starting at 10 GiB for a production deployment, as it will likely be more than enough.

RAM and shared_buffers

Since the data size is small, it is recommended to provide enough memory to hold the vast majority of that data in RAM: Ideally configure shared_buffers of your PostgreSQL instances to be large enough to contain the currently active dataset. Pick a database instance with enough RAM to accommodate this buffer size, at roughly x4 the size given for shared_buffers (so for example, if an installation has ~500,000 uncommitted writes at any given time, it would require about 750 MiB of shared_buffers, that would require about 3 GiB of RAM).

Cloud providers will save you the need to tune this parameter. It will be set to a fixed percentage the chosen instance’s available RAM (25% on AWS RDS, 30% on Google Cloud SQL).

CPU

PostgreSQL CPU cores help scale concurrent requests. 1 CPU core for every 5,000 requests/second is ideal.

Scaling factors

Scaling lakeFS, like most data systems, moves across 2 axes: throughput of requests (amount per given timeframe), and latency (time to complete a single request).

Understanding latency and throughput considerations

Most lakeFS operations are designed to be very low in latency. Assuming a well tuned local disk cache (see Storage above), most critical path operations (writing objects, requesting objects, deleting objects) are designed to complete in <25ms at p90. Listing objects obviously requires accessing more data, but should always be on-par with what the underlying object store can provide, and in most cases, it is actually faster. At the worst case for directory listing with 1,000 common prefixes returned, expect a latency of 75ms at p90.

Managing branches (creating them, listing them and deleting them) are all constant-time operations, generally taking <30ms at p90.

Committing and merging can take longer, as they are proportional to the amount of changes introduced. This is what makes lakeFS optimal for large Data Lakes - the amount of changes introduced per commit usually stays relatively stable, while the entire data set usually grows over time. This means lakeFS will provide predictable performance: committing 100 changes will take roughly the same amount of time whether the resulting commit contains 500 or 500 million objects.

See Data Model for more information.

Scaling throughput depends very much on the amount of CPU cores available to lakeFS. In many cases it is easier to scale lakeFS across a fleet of smaller cloud instances (or containers) than it is to scale up with machines that have many cores, and in fact, lakeFS works well in both cases. Most critical path operations scale very well across machines.

Benchmarks

All benchmarks below were measured using 2 x c5ad.4xlarge instances on AWS us-east-1. Similar results can be achieved on Google Cloud using a c2-standard-16 machine type, with an attached local SSD. On Azure, a Standard_F16s_v2 virtual machine can be used.

The PostgreSQL instance that was used is a db.m6g.2xlarge (8 vCPUs, 32 GB RAM). Equivalent machines on Google Cloud or Azure should yield similar results.

The example repository we tested against contains the metadata of a large lakeFS installation, where each commit contains ~180,000,000 objects (representing ~7.5 Petabytes of data).

All tests are reproducible using the lakectl abuse command, so do use it to properly size and tune your setup. All tests are accompanied by the relevant lakectl abuse command that generated them.

Random reads

This test generates random read requests to lakeFS, in a given commit. Paths are requested randomly from a file containing a set of preconfigured (and existing) paths.

command executed:

lakectl abuse random-read \
    --from-file randomly_selected_paths.txt \
    --amount 500000 \
    --parallelism 128 \
    lakefs://example-repo@<commit hash>

Result Histogram (raw):

Histogram (ms):
1	0
2	0
5	37945
7	179727
10	296964
15	399682
25	477502
50	499625
75	499998
100	499998
250	500000
350	500000
500	500000
750	500000
1000	500000
5000	500000
min	3
max	222
total	500000

So 50% of all requests took <10ms, while 99.9% of them took <50ms

throughput:

Average throughput during the experiment was 10851.69 requests/second

Random Writes

This test generates random write requests to a given lakeFS branch. All paths are pre-generated and do not overwrite each other (as overwrites are relatively rare in a Data Lake setup).

command executed:

lakectl abuse random-write \
    --amount 500000 \
    --parallelism 64 \
    lakefs://example-repo@main

Result Histogram (raw):

Histogram (ms):
1	0
2	0
5	30715
7	219647
10	455807
15	498144
25	499535
50	499742
75	499784
100	499802
250	500000
350	500000
500	500000
750	500000
1000	500000
5000	500000
min	3
max	233
total	500000

So 50% of all requests took <10ms, while 99.9% of them took <25ms

throughput:

Average throughput during the experiment was 7595.46 requests/second

Branch creation

This test creates branches from a given reference

command executed:

lakectl abuse create-branches \
    --amount 500000 \
    --branch-prefix "benchmark-" \
    --parallelism 256 \
    lakefs://example-repo@<commit hash>

Result Histogram (raw):

Histogram (ms):
1	0
2	1
5	5901
7	39835
10	135863
15	270201
25	399895
50	484932
75	497180
100	499303
250	499996
350	500000
500	500000
750	500000
1000	500000
5000	500000
min	2
max	304
total	500000

So 50% of all requests took <15ms, while 99.9% of them took <100ms

throughput:

Average throughput during the experiment was 7069.03 requests/second

Important metrics

lakeFS exposes metrics using the Prometheus protocol. Every lakeFS instance exposes a /metrics endpoint that could be used to extract them.

Here are a few notable metrics to keep track of when sizing lakeFS:

api_requests_total - Tracks throughput of API requests over time

api_request_duration_seconds - Histogram of latency per operation type

gateway_request_duration_seconds - Histogram of latency per S3 Gateway operation

go_sql_stats_* - Important client-side metrics collected from the PostgreSQL driver. See The full reference here.

Reference architectures

Below are a few example architectures for lakeFS deployment

Reference Architecture: Data Science/Research environment

Use case: Manage Machine learning or algorithms development. Use lakeFS branches to achieve both isolation and reproducibility of experiments. Data being managed by lakeFS is both structured, tabular data; as well as unstructured sensor and image data used for training. Assuming a team of 20-50 researchers, with a dataset size of 500 TiB across 20M objects.

Environment: lakeFS will be deployed on Kubernetes managed by AWS EKS with PostgreSQL on AWS RDS Aurora

Sizing: Since most of the work is done by humans (vs automated pipelines), most experiments tend to be small in scale, reading and writing 10s to 1000s of objects. The expected amount of branches active in parallel is relatively low, around 1-2 per user, each representing a small amount of uncommitted changes at any given point in time. Let’s assume 5,000 uncommitted writes per branch = ~500k.

To support the expected throughput, a single moderate lakeFS instance should be more than enough, since requests per second would be on the order of 10s to 100s. For high availability, we’ll deploy 2 pods with 1 CPU core and 1 GiB of RAM each..

Since the PostgreSQL instance is expected to hold a very small dataset (at 500k, expected dataset size is 150MiB (for 100k records) * 5 = 750MiB). To ensure we have enough RAM to hold this, we’ll need 3 GiB of RAM, so a very moderate Aurora instance db.t3.large (2 vCPUs, 8 GB RAM) will be more than enough. An equivalent database instance on GCP or Azure should give similar results.

ML and Research lakeFS reference architecture

Reference Architecture: Automated Production Pipelines

Use case: Manage multiple concurrent data pipelines using Apache Spark and Airflow. Airflow DAGs start by creating a branch for isolation and for CI/CD. Data being managed by lakeFS is structured, tabular data. Total dataset size is 10 PiB, spanning across 500M objects. Expected throughput is 10k reads/second + 2k writes per second across 100 concurrent branches.

Environment: lakeFS will be deployed on Kubernetes managed by AWS EKS with PostgreSQL on AWS RDS

Sizing: Data pipelines tend to be bursty in nature: reading in a lot of objects concurrently, doing some calculation or aggregation and then writing many objects concurrently. The expected amount of branches active in parallel is high, with many Airflow DAGs running per day, each representing a moderate amount of uncommitted changes at any given point in time. Let’s assume 1,000 uncommitted writes/branch * 2500 branches = ~2.5M records.

To support the expected throughput, looking the benchmarking numbers above, we’re doing roughly 625 requests/core, so 24 cores should cover our peak traffic. We can deploy 6 * 4 CPU core pods.

The PostgreSQL instance (at 500k, expected dataset size is 150MiB (for 100k records) * 25 = 3750 MiB). To ensure we have enough RAM to hold this, we’ll need at least 15 GiB of RAM, so we’ll go with a db.r5.xlarge (4 vCPUs, 32GB RAM) Aurora instance. An equivalent database instance on GCP or Azure should give similar results.

Automated pipelines lakeFS reference architecture