Use this guide to achieve the best performance with lakeFS.
If you’re having issues with slow commits, consider performing smaller commits. The more objects in the commit, the longer the commit will take to complete. Moreover, a commit should represent a meaningful point in your data’s lifecycle, so limiting the commit size will create a more comprehensible commit history. As a rule of thumb, try to keep your commits smaller than 1M objects.
Just like in Git, branch history is composed by commits and is linear by nature. Concurrent commits/merges on the same branch result in a race. The first operation will finish successfully while the rest will retry.
To import object into lakeFS, either a single time or regularly, lakeFS offers a zero-copy import feature. Use this feature to import a large number of objects to lakeFS, instead of simply copying them into your repository. This feature will create a reference to the existing objects on your bucket and avoids the copy.
In cases where you are only interested in reading committed data:
- Use a commit ID (or a tag ID) in your path (e.g:
@before the path
When accessing data using the branch name (e.g.
lakefs://repo/main/path) lakeFS will also try to fetch uncommitted data, which may result in reduced performance.
For more information, see how uncommitted data is managed in lakeFS
Sometimes, storage operations can become a bottleneck. For example, when your data pipelines upload many big objects. In such cases, it can be beneficial to perform only versioning operations on lakeFS, while performing storage reads/writes directly on the object store. lakeFS offers multiple ways to do that:
lakectl upload --directcommand (or download).
- The lakeFS Hadoop Filesystem.
- The staging API which can be used to add lakeFS references to objects after having written them to the storage.
Accessing the object store directly is a faster way to interact with your data.
lakeFS provides a zero-copy mechanism to data. Instead of copying the data, we can check out to a new branch. Creating a new branch will take constant time as the new branch points to the same data as its parent. It will also lower the storage cost.