- System Requirements
- Scaling factors
- Important metrics
- Reference architectures
lakeFS can run on MacOS and Linux (Windows binaries are available but not rigorously tested - we don’t recommend deploying lakeFS to production on Windows). x86_64 and arm64 architectures are supported for both MacOS and Linux.
lakeFS servers require a minimum of 512mb of RAM and 1 CPU core. For high throughput, additional CPUs help scale requests across different cores. “Expensive” operations such as large diff or commit operations can take advantage of multiple cores.
If using the data APIs such as the S3 Gateway, lakeFS will require enough network bandwidth to support the planned concurrent network upload/download operations. For most cloud providers, more powerful machines (i.e. more expensive and usually with more CPU cores) also provide increased network bandwidth.
If using only the metadata APIs (for example, only using the Hadoop/Spark clients), network bandwidth is minimal, at roughly 1Kb per request.
lakeFS greatly benefits from fast local disks. A lakeFS instance doesn’t require any strong durability guarantees from the underlying storage, as the disk is only ever used as a local caching layer for lakeFS metadata, and not for long-term storage. lakeFS is designed to work with ephemeral disks - these are usually based on NVMe and are tied to the machine’s lifecycle. Using ephemeral disks lakeFS can provide a very high throughput/cost ratio, probably the best that could be achieved on a public cloud, so we recommend those.
A local cache of at least 512 MiB should be provided. For large installations (managing >100 concurrently active branches, with >100M objects per commit), we recommend allocating at least 10 GiB - since it’s a caching layer over a relatively slow storage (the object store), see Important metrics below to understand how to size this: it should be big enough to hold all commit metadata for actively referenced commits.
lakeFS uses a PostgreSQL instance to manage branch references, authentication and authorization information and to keep track of currently uncommitted data across branches.
The dataset stored in PostgreSQL is relatively modest, as most metadata is pushed down into the object store. Required storage is mostly a factor of the amount of uncommitted writes across all branches at any given point in time: in the range of 150 MiB per every 100,000 uncommitted writes.
We recommend starting at 10 GiB for a production deployment, as it will likely be more than enough.
Since the data size is small, it is recommended to provide enough memory to hold the vast majority of that data in RAM: Ideally configure shared_buffers of your PostgreSQL instances to be large enough to contain the currently active dataset. Pick a database instance with enough RAM to accommodate this buffer size, at roughly x4 the size given for
shared_buffers (so for example, if an installation has ~500,000 uncommitted writes at any given time, it would require about 750 MiB of
shared_buffers, that would require about 3 GiB of RAM).
Cloud providers will save you the need to tune this parameter. It will be set to a fixed percentage the chosen instance’s available RAM (25% on AWS RDS, 30% on Google Cloud SQL).
PostgreSQL CPU cores help scale concurrent requests. 1 CPU core for every 5,000 requests/second is ideal.
Scaling lakeFS, like most data systems, moves across 2 axes: throughput of requests (amount per given timeframe), and latency (time to complete a single request).
Most lakeFS operations are designed to be very low in latency. Assuming a well tuned local disk cache (see Storage above), most critical path operations (writing objects, requesting objects, deleting objects) are designed to complete in <25ms at p90. Listing objects obviously requires accessing more data, but should always be on-par with what the underlying object store can provide, and in most cases, it is actually faster. At the worst case for directory listing with 1,000 common prefixes returned, expect a latency of 75ms at p90.
Managing branches (creating them, listing them and deleting them) are all constant-time operations, generally taking <30ms at p90.
Committing and merging can take longer, as they are proportional to the amount of changes introduced. This is what makes lakeFS optimal for large Data Lakes - the amount of changes introduced per commit usually stays relatively stable, while the entire data set usually grows over time. This means lakeFS will provide predictable performance: committing 100 changes will take roughly the same amount of time whether the resulting commit contains 500 or 500 million objects.
See Data Model for more information.
Scaling throughput depends very much on the amount of CPU cores available to lakeFS. In many cases it is easier to scale lakeFS across a fleet of smaller cloud instances (or containers) than it is to scale up with machines that have many cores, and in fact, lakeFS works well in both cases. Most critical path operations scale very well across machines.
All benchmarks below were measured using 2 x c5ad.4xlarge instances on AWS us-east-1. Similar results can be achieved on Google Cloud using a
c2-standard-16 machine type, with an attached local SSD. On Azure, a
Standard_F16s_v2 virtual machine can be used.
The PostgreSQL instance that was used is a db.m6g.2xlarge (8 vCPUs, 32 GB RAM). Equivalent machines on Google Cloud or Azure should yield similar results.
The example repository we tested against contains the metadata of a large lakeFS installation, where each commit contains ~180,000,000 objects (representing ~7.5 Petabytes of data).
All tests are reproducible using the lakectl abuse command, so do use it to properly size and tune your setup. All tests are accompanied by the relevant
lakectl abuse command that generated them.
This test generates random read requests to lakeFS, in a given commit. Paths are requested randomly from a file containing a set of preconfigured (and existing) paths.
lakectl abuse random-read \ --from-file randomly_selected_paths.txt \ --amount 500000 \ --parallelism 128 \ lakefs://example-repo@<commit hash>
Result Histogram (raw):
Histogram (ms): 1 0 2 0 5 37945 7 179727 10 296964 15 399682 25 477502 50 499625 75 499998 100 499998 250 500000 350 500000 500 500000 750 500000 1000 500000 5000 500000 min 3 max 222 total 500000
So 50% of all requests took <10ms, while 99.9% of them took <50ms
Average throughput during the experiment was 10851.69 requests/second
This test generates random write requests to a given lakeFS branch. All paths are pre-generated and do not overwrite each other (as overwrites are relatively rare in a Data Lake setup).
lakectl abuse random-write \ --amount 500000 \ --parallelism 64 \ lakefs://example-repo@main
Result Histogram (raw):
Histogram (ms): 1 0 2 0 5 30715 7 219647 10 455807 15 498144 25 499535 50 499742 75 499784 100 499802 250 500000 350 500000 500 500000 750 500000 1000 500000 5000 500000 min 3 max 233 total 500000
So 50% of all requests took <10ms, while 99.9% of them took <25ms
Average throughput during the experiment was 7595.46 requests/second
This test creates branches from a given reference
lakectl abuse create-branches \ --amount 500000 \ --branch-prefix "benchmark-" \ --parallelism 256 \ lakefs://example-repo@<commit hash>
Result Histogram (raw):
Histogram (ms): 1 0 2 1 5 5901 7 39835 10 135863 15 270201 25 399895 50 484932 75 497180 100 499303 250 499996 350 500000 500 500000 750 500000 1000 500000 5000 500000 min 2 max 304 total 500000
So 50% of all requests took <15ms, while 99.9% of them took <100ms
Average throughput during the experiment was 7069.03 requests/second
lakeFS exposes metrics using the Prometheus protocol. Every lakeFS instance exposes a
/metrics endpoint that could be used to extract them.
Here are a few notable metrics to keep track of when sizing lakeFS:
api_requests_total - Tracks throughput of API requests over time
api_request_duration_seconds - Histogram of latency per operation type
gateway_request_duration_seconds - Histogram of latency per S3 Gateway operation
go_sql_stats_* - Important client-side metrics collected from the PostgreSQL driver. See The full reference here.
Below are a few example architectures for lakeFS deployment
Use case: Manage Machine learning or algorithms development. Use lakeFS branches to achieve both isolation and reproducibility of experiments. Data being managed by lakeFS is both structured, tabular data; as well as unstructured sensor and image data used for training. Assuming a team of 20-50 researchers, with a dataset size of 500 TiB across 20M objects.
Sizing: Since most of the work is done by humans (vs automated pipelines), most experiments tend to be small in scale, reading and writing 10s to 1000s of objects. The expected amount of branches active in parallel is relatively low, around 1-2 per user, each representing a small amount of uncommitted changes at any given point in time. Let’s assume 5,000 uncommitted writes per branch = ~500k.
To support the expected throughput, a single moderate lakeFS instance should be more than enough, since requests per second would be on the order of 10s to 100s. For high availability, we’ll deploy 2 pods with 1 CPU core and 1 GiB of RAM each..
Since the PostgreSQL instance is expected to hold a very small dataset (at 500k, expected dataset size is
150MiB (for 100k records) * 5 = 750MiB). To ensure we have enough RAM to hold this, we’ll need 3 GiB of RAM, so a very moderate Aurora instance
db.t3.large (2 vCPUs, 8 GB RAM) will be more than enough. An equivalent database instance on GCP or Azure should give similar results.
Use case: Manage multiple concurrent data pipelines using Apache Spark and Airflow. Airflow DAGs start by creating a branch for isolation and for CI/CD. Data being managed by lakeFS is structured, tabular data. Total dataset size is 10 PiB, spanning across 500M objects. Expected throughput is 10k reads/second + 2k writes per second across 100 concurrent branches.
Sizing: Data pipelines tend to be bursty in nature: reading in a lot of objects concurrently, doing some calculation or aggregation and then writing many objects concurrently. The expected amount of branches active in parallel is high, with many Airflow DAGs running per day, each representing a moderate amount of uncommitted changes at any given point in time. Let’s assume 1,000 uncommitted writes/branch * 2500 branches = ~2.5M records.
To support the expected throughput, looking the benchmarking numbers above, we’re doing roughly 625 requests/core, so 24 cores should cover our peak traffic. We can deploy
6 * 4 CPU core pods.
The PostgreSQL instance (at 500k, expected dataset size is
150MiB (for 100k records) * 25 = 3750 MiB). To ensure we have enough RAM to hold this, we’ll need at least 15 GiB of RAM, so we’ll go with a
db.r5.xlarge (4 vCPUs, 32GB RAM) Aurora instance. An equivalent database instance on GCP or Azure should give similar results.