lakeFS blends concepts from object stores such as S3 with concepts from Git. This reference defines the common concepts of lakeFS. Every concept appearing in italic text is its definition.
lakeFS is an object store, and borrows concepts from S3.
An object store links objects to paths. An object holds:
- Some contents, with unlimited size and format.
- Some metadata, including
- size in bytes
- the creation time, a timestamp with seconds resolution
- a checksum string which uniquely identifies the contents
- some user metadata, a small map of strings to strings.
Similarly to many object stores, lakeFS objects are immutable and never rewritten. They can be entirely replaced or deleted, but not modified.
A path is a readable string, typically decoded as UTF-8. lakeFS maps paths to their objects
according to specific rules. lakeFS paths use the
lakefs protocol, described below.
lakeFS borrows its concepts for version control from Git.
A repository is a collection of objects with common history tracking. lakeFS manages versions of the repository, identified by their commits. A commit is a collection of object metadata and data, including especially all paths and the object contents and metadata at that commit. Commits have their own commit metadata, which includes a textual comment and additional user metadata.
Commits are organized into a history using their parent commits. Every repository has exactly one initial commit with no parents. Note that a Git repository may have multiple initial commits. A commit with more than one parent is a merge commit. Currently lakeFS only supports merge commits with two parents.
A commit is identified by its commit ID, a digest of all contents of the commit. Commit IDs are by nature long, so a unique prefix may be used to abbreviate them (but note that short prefixes can become non-unique as the repository grows, so prefer to avoid abbreviating when storing commits for the long term). A commit may also be identified by using a textual definition, called a ref. Examples of refs include tags, branches, and expressions. The state of the repository at any commit is always readable.
A tag is an immutable pointer to a single commit. Tags have readable names. Because tags are commits, a repository can be read from any tag. Example tags:
v2.3to mark a release
dev:jane-before-v2.3-mergeto mark Jane’s private temporary point.
A branch is a mutable pointer to a commit and its staging area. Repositories are readable from any branch, but they are also writable to a branch. The staging area associated with a branch is mutable storage where objects can be created, updated or deleted. These objects are readable when reading from the branch. To create a commit from a branch, all files from the staging area are merged into the contents of the current branch, creating the new set of objects. The parent of the commit is the previous branch tip, and the new branch tip is set to this new commit. Example branches:
main, the trunk
staging, maybe ahead of
dev:joe-bugfix-1234for Joe to fix issue 1234.
lakeFS also supports expressions for creating a ref. These are similar to revisions in
Git; indeed all
examples at the end of that section will work unchanged in lakeFS.
- A branch or a tag are ref expressions.
<ref>is a ref expression, then:
<ref>^is a ref expression referring to its first parent.
<ref>^Nis a ref expression referring to its N’th parent; in particular
<ref>^1is the same as
<ref>~is a ref expression referring to its first parent; in particular
<ref>~is the same as
<ref>~Nis a ref expression referring to its N’th parent, always traversing to the first parent. So
<ref>~Nis the same as
<ref>^^...^with N consecutive carets
The history of the branch is the list of commits from the branch tip through the first parent of each commit. Histories go back in time.
The other way to create a commit is to merge an existing commit onto a branch. To merge a source commit into a branch, lakeFS finds the best common ancestor of that source commit and the branch tip, called the “base”. Then it performs a 3-way merge. The “best” ancestor is exactly that defined in the documentation for git-merge-base. The result of a merge is a new commit, with the destination as the first parent and the source as the second. Thus the previous tip of the merge destination is part of the history of the merged object.
To merge a merge source (a commit) into a merge destination (another commit), lakeFS first finds the merge base, the nearest common parent of the two commits. It can now perform a three-way merge, by examining the presence and identity of files in each commit. In the table below, “A”, “B” and “C” are possible file contents, “X” is a missing file, and “conflict” (which only appears as a result) is a merge failure.
|In base||In source||In destination||Result||Comment|
|A||B||B||B||Files changed on both sides in same way|
|A||B||C||conflict||Files changed on both sides differently|
|A||A||B||B||File changed only on one branch|
|A||B||A||B||File changed only on one branch|
|A||X||X||X||Files deleted on both sides|
|A||B||X||conflict||File changed on one side, deleted on the other|
|A||X||B||conflict||File changed on one side, deleted on the other|
|A||A||X||X||File deleted on one side|
|A||X||A||X||File deleted on one side|
The API and lakectl allow passing an optional
strategy flag with the following values:
- dest-wins - in case of a conflict, merge will pick the destination object.
- source-wins - in case of a conflict, merge will pick the source object. If the strategy is set, it will affect all the objects in the merge, there is currently no way to treat each conflict differently.
As a format-agnostic system, lakeFS currently merges by complete files. Format-specific and other user-defined merge strategies for handling conflicts are on the roadmap.
Underlying storage is the area on some other object store that lakeFS uses to store object contents and some of its metadata. We sometimes refer to underlying storage as physical. The path used to store the contents of an object is then termed a physical path. The object itself on underlying storage is never modified, except to remove it entirely during some cleanups.
When creating a lakeFS repository, you assign it with a storage namespace. The repository’s storage namespace is the prefix in the underlying storage where data for this repository will be stored.
A lot of what lakeFS does is to manage how lakeFS paths translate to physical paths on the object store. This mapping is generally not straightforward. Importantly (and unlike many object stores), lakeFS may map multiple paths to the same object on backing storage, and always does this for objects that are unchanged across versions.
lakeFS uses a specific format for path URIs. The URI
lakefs://<REPO>/<REF>/<KEY> is a path
to objects in the given repo and ref expression under key. This is used both for path
prefixes and for full paths. In similar fashion,
lakefs://<REPO>/<REF> identifies the
repository at a ref expression, and
lakefs://<REPO> identifes a repo.