- Use Case: Development Environment
- Use Case: Continuous Integration
- Use Case: Continuous Deployment
TL;DR - After receiving feedback on early versions of lakeFS, project “lakeFS on the Rocks” represents a set of changes to the architecture and data model of lakeFS. The main motivators are simplicity, reduced barriers of entry, scalability - and the added benefit of having lakeFS adhere more closely to Git in semantics and UX.
There are 3 big shifts in design:
- Using the underlying object store as a source of truth for all committed data. We do this by storing commits as RocksDB SSTable files, where each commit is a “snapshot” of a given repository, split across multiple SSTable files that could be reused across commits.
- Removal of PostgreSQL as a dependency: scaling it for very high throughput while keeping it predictable in performance for different loads and access patterns has a very high operational cost.
- Extract metadata operations into a separate service (with the S3 API Gateway remaining as a stateless client for this service). Would allow for the development of “native” clients for big data tools that don’t require passing the data itself through lakeFS, but rather talk directly to the underlying object store.
The change will be made gradually over (at least) 3 releases:
The initial release of the new lakeFS design will include the following changes:
- Commit metadata stored on S3 in SSTable format
- Uncommitted entries will be stored in PostgreSQL
- Refs (branches, tags) will be stored in PostgreSQL
This release doesn’t change the way lakeFS is deployed or operates - but represents a major change in data model (moving from legacy MVCC data model, to a Merkle tree structure, which brings lakeFS much closer to Git).
While there’s still a strong dependency on PostgreSQL, the schema and access patterns are much simpler, resulting in improved performance and reduced operational ovearhead.
Extracting metadata into its own service:
- gRPC service that exposes metadata operations
- S3 gateway API and OpenAPI servers become stateless and act as clients for the metadata service
- Native Hadoop Filesystem implementation on top of the metadata service, with Apache Spark support (depends on Java native SDK for metadata server below)
Removing PostgreSQL for uncommitted data and refs, moving to Raft:
- Turn the metadata server into a Raft consensus group, managing state in RocksDB
- Remove dependency on PostgreSQL
- Raft snapshots stored natively on underlying object stores
This release will mark the completion of project “lakeFS on the Rocks”
We see Kubernetes as a first class deployment target for lakeFS. While deploying the stateless components such as the S3 Gateway and the OpenAPI server is relatively easy, deploying the metadata service which is a stateful Raft group is a little more involved. Design is still pending while we learn more about the best practices (for example, the Consul Kubernetes operator).
Reduce the operational overhead of managing access control: Currently operators working with both lakeFS and the native object store are required to manage a similar set of access controls for both. Moving to a federated access control model using the object store’s native access control facilities (e.g. IAM) will help reduce this overhead. This requires more discovery around the different use cases to help design something coherent. If you’re using lakeFS and have strong opinions about access control, please reach out on Slack.
Will be the basis of a Hadoop Filesystem implementation. It will allow compute engines to access the data directly without proxying it through lakeFS, improving operational efficiency and scalability.
In order to support automated data CI/CD integration with pipeline management and orchestration tools such as Apache Airflow and Luigi (see Continuous Deployment below), a Python SDK for lakeFS metadata API is required. It will be used by the libraries native to each orchestration tool.
Throwaway development or experimentation branches that live for a pre-configured amount of time, and are cleaned up afterwards. This is especially useful when running automated tests or when experimenting with new technologies, code or algorithms. We want to see what the outcome looks like, but don’t really need the output to live much longer than the duration of the experiment.
The ability to explicitly depend on data residing in another repository. While it is possible to state these cross links by sticking them in the report’s commit metadata, we think a more explicit and structured approach would be valuable. Stating our dependencies in something that resembles a pom.xml or go.mod file would allow us to support better CI and CD integrations that ensure reproducibility without vendoring or copying data.
Being able to have pre-defined code execute before a commit or merge operation - potentially preventing that action from taking place. This allows lakeFS users to codify best practices (format and schema enforcement before merging to master) as well as run tools such as Great Expectations or Monte Carlo before data ever reaches consumers.
A way to ensure certain branches (i.e. main or master) are only merged to, and are not being directly written to. In combination with Webhook Support (see above), this allows users to provide a set of quality guarantees about a given branch (i.e., reading from master ensures schema never breaks and all partitions are complete and tested)
Using webhooks, we can automatically register or update collections in a Hive/Glue metastore, using Symlink Generation, this will also allow systems that don’t natively integrate with lakeFS to consume data produced using lakeFS.
Provide a basic wrapper around something like pyArrow that validates Parquet or ORC files for common schema problems such as backwards incompatibility.
Provide a set of reusable building blocks for Airflow that can create branches, commit and merge. The idea here is to enhance existing pipelines that, for example, run a series of Spark jobs, with an easy way to create a lakeFS branch before starting, passing that branch as a parameter to all Spark jobs, and upon successful execution, commit and merge their output to master.
Provide a webhook around a tool such as Great Expectations that runs data quality tests before merging into a main/master branch.
Support integration into existing alerting systems that trigger in the event a webhook returns a failure. This is useful for example when a data quality test fails, so new data is not merged into main/master due to a quality issue, so will alert the owning team.