Performance Best Practices
Table of contents
Use this guide to achieve the best performance with lakeFS.
Avoid concurrent commits/merges
Just like in Git, branch history is composed by commits and is linear by nature. Concurrent commits/merges on the same branch result in a race. The first operation will finish successfully while the rest will retry.
Perform meaningful commits
It’s a good idea to perform commits that are meaningful in the senese that they represent a logical point in your data’s lifecycle. While lakeFS supports arbirartily large commits, avoiding commits with a huge number of objects will result in a more comprehensible commit history.
Use zero-copy import
To import object into lakeFS, either a single time or regularly, lakeFS offers a zero-copy import feature. Use this feature to import a large number of objects to lakeFS, instead of simply copying them into your repository. This feature will create a reference to the existing objects on your bucket and avoids the copy.
Read data using the commit ID
In cases where you are only interested in reading committed data:
- Use a commit ID (or a tag ID) in your path (e.g:
@before the path
When accessing data using the branch name (e.g.
lakefs://repo/main/path) lakeFS will also try to fetch uncommitted data, which may result in reduced performance.
For more information, see how uncommitted data is managed in lakeFS
Operate directly on the storage
Sometimes, storage operations can become a bottleneck. For example, when your data pipelines upload many big objects. In such cases, it can be beneficial to perform only versioning operations on lakeFS, while performing storage reads/writes directly on the object store. lakeFS offers multiple ways to do that:
lakectl upload --directcommand (or download).
- The lakeFS Hadoop Filesystem.
- The staging API which can be used to add lakeFS references to objects after having written them to the storage.
Accessing the object store directly is a faster way to interact with your data.
lakeFS provides a zero-copy mechanism to data. Instead of copying the data, we can check out to a new branch. Creating a new branch will take constant time as the new branch points to the same data as its parent. It will also lower the storage cost.