Link Search Menu Expand Document


Table of contents

  1. Architecture
    1. lakeFS On The Rocks: Milestone #1 - Committed Metadata Extraction
    2. lakeFS On The Rocks: Milestone #2 - Metadata Service
    3. lakeFS On The Rocks: Milestone #3 - Remove PostgreSQL
  2. Operations
    1. Kubernetes operator for gateways and Metadata servers
    2. Azure Data Lake Storage Support
    3. Metadata operations security and access model
  3. Clients
    1. Java native SDK for metadata server
    2. Python SDK for metadata server
  4. Use Case: Development Environment
    1. Ephemeral branches with a TTL
  5. Use Case: Continuous Integration
    1. Repo linking
    2. Webhook Support
    3. Protected Branches
    4. Webhook Support integration: Metastore registration
    5. Webhook Support integration: Metadata validation
  6. Use Case: Continuous Deployment
    1. Airflow Operators
    2. Webhook Support integration: Data Quality testing
    3. Webhook Alerting


TL;DR - After receiving feedback on early versions of lakeFS, project “lakeFS on the Rocks” represents a set of changes to the architecture and data model of lakeFS. The main motivators are simplicity, reduced barriers of entry, scalability - and the added benefit of having lakeFS adhere more closely to Git in semantics and UX.

There are 3 big shifts in design:

  1. Using the underlying object store as a source of truth for all committed data. We do this by storing commits as RocksDB SSTable files, where each commit is a “snapshot” of a given repository, split across multiple SSTable files that could be reused across commits.
  2. Removal of PostgreSQL as a dependency: scaling it for very high throughput while keeping it predictable in performance for different loads and access patterns has a very high operational cost.
  3. Extract metadata operations into a separate service (with the S3 API Gateway remaining as a stateless client for this service). Would allow for the development of “native” clients for big data tools that don’t require passing the data itself through lakeFS, but rather talk directly to the underlying object store.

The change will be made gradually over (at least) 3 releases:

lakeFS On The Rocks: Milestone #1 - Committed Metadata Extraction

The initial release of the new lakeFS design will include the following changes:

  1. Commit metadata stored on S3 in SSTable format
  2. Uncommitted entries will be stored in PostgreSQL
  3. Refs (branches, tags) will be stored in PostgreSQL

This release doesn’t change the way lakeFS is deployed or operates - but represents a major change in data model (moving from legacy MVCC data model, to a Merkle tree structure, which brings lakeFS much closer to Git).

While there’s still a strong dependency on PostgreSQL, the schema and access patterns are much simpler, resulting in improved performance and reduced operational ovearhead.

lakeFS On The Rocks: Milestone #2 - Metadata Service

Extracting metadata into its own service:

  1. gRPC service that exposes metadata operations
  2. S3 gateway API and OpenAPI servers become stateless and act as clients for the metadata service
  3. Native Hadoop Filesystem implementation on top of the metadata service, with Apache Spark support (depends on Java native SDK for metadata server below)

lakeFS On The Rocks: Milestone #3 - Remove PostgreSQL

Removing PostgreSQL for uncommitted data and refs, moving to Raft:

  1. Turn the metadata server into a Raft consensus group, managing state in RocksDB
  2. Remove dependency on PostgreSQL
  3. Raft snapshots stored natively on underlying object stores

This release will mark the completion of project “lakeFS on the Rocks”


Kubernetes operator for gateways and Metadata servers

We see Kubernetes as a first class deployment target for lakeFS. While deploying the stateless components such as the S3 Gateway and the OpenAPI server is relatively easy, deploying the metadata service which is a stateful Raft group is a little more involved. Design is still pending while we learn more about the best practices (for example, the Consul Kubernetes operator).

Azure Data Lake Storage Support

Allow lakeFS to run natively on Azure, with full support for storing both metadata and data on ADLS

Metadata operations security and access model

Reduce the operational overhead of managing access control: Currently operators working with both lakeFS and the native object store are required to manage a similar set of access controls for both. Moving to a federated access control model using the object store’s native access control facilities (e.g. IAM) will help reduce this overhead. This requires more discovery around the different use cases to help design something coherent. If you’re using lakeFS and have strong opinions about access control, please reach out on Slack.


Java native SDK for metadata server

Will be the basis of a Hadoop Filesystem implementation. It will allow compute engines to access the data directly without proxying it through lakeFS, improving operational efficiency and scalability.

Python SDK for metadata server

In order to support automated data CI/CD integration with pipeline management and orchestration tools such as Apache Airflow and Luigi (see Continuous Deployment below), a Python SDK for lakeFS metadata API is required. It will be used by the libraries native to each orchestration tool.

Use Case: Development Environment

Ephemeral branches with a TTL

Throwaway development or experimentation branches that live for a pre-configured amount of time, and are cleaned up afterwards. This is especially useful when running automated tests or when experimenting with new technologies, code or algorithms. We want to see what the outcome looks like, but don’t really need the output to live much longer than the duration of the experiment.

Use Case: Continuous Integration

Repo linking

The ability to explicitly depend on data residing in another repository. While it is possible to state these cross links by sticking them in the report’s commit metadata, we think a more explicit and structured approach would be valuable. Stating our dependencies in something that resembles a pom.xml or go.mod file would allow us to support better CI and CD integrations that ensure reproducibility without vendoring or copying data.

Webhook Support

Being able to have pre-defined code execute before a commit or merge operation - potentially preventing that action from taking place. This allows lakeFS users to codify best practices (format and schema enforcement before merging to master) as well as run tools such as Great Expectations or Monte Carlo before data ever reaches consumers.

Protected Branches

A way to ensure certain branches (i.e. main or master) are only merged to, and are not being directly written to. In combination with Webhook Support (see above), this allows users to provide a set of quality guarantees about a given branch (i.e., reading from master ensures schema never breaks and all partitions are complete and tested)

Webhook Support integration: Metastore registration

Using webhooks, we can automatically register or update collections in a Hive/Glue metastore, using Symlink Generation, this will also allow systems that don’t natively integrate with lakeFS to consume data produced using lakeFS.

Webhook Support integration: Metadata validation

Provide a basic wrapper around something like pyArrow that validates Parquet or ORC files for common schema problems such as backwards incompatibility.

Use Case: Continuous Deployment

Airflow Operators

Provide a set of reusable building blocks for Airflow that can create branches, commit and merge. The idea here is to enhance existing pipelines that, for example, run a series of Spark jobs, with an easy way to create a lakeFS branch before starting, passing that branch as a parameter to all Spark jobs, and upon successful execution, commit and merge their output to master.

Webhook Support integration: Data Quality testing

Provide a webhook around a tool such as Great Expectations that runs data quality tests before merging into a main/master branch.

Webhook Alerting

Support integration into existing alerting systems that trigger in the event a webhook returns a failure. This is useful for example when a data quality test fails, so new data is not merged into main/master due to a quality issue, so will alert the owning team.