MATLAB Integration¶
Integrate lakeFS with MATLAB for data version control in MATLAB data science, data engineering, and machine learning workflows.
Overview¶
The lakeFS MATLAB integration provides native MATLAB functions to interact with lakeFS repositories, enabling:
- Version control for data: Track datasets, models, and results across experiments
- Reproducible research: Link code to specific data versions with commit IDs
- Isolated experimentation: Create branches for parallel experiments without data duplication
- Direct data access: Mount lakeFS repositories as local directories using Everest
This integration uses two MATLAB helper classes:
- lakefs.m: Core lakeFS operations (branches, commits, tags, metadata)
- everest.m: File system mounting for direct data access
Prerequisites¶
Before using lakeFS with MATLAB, ensure you have:
- lakeFS Server: Running instance (local or cloud)
- Quick Start Guide
-
lakectl CLI: Installed and configured
- lakectl Installation
-
Configuration file at
~/.lakectl.yaml -
Everest (optional, for mounting): File system interface
- Everest Documentation
-
Install:
brew install treeverse/brew/everest(macOS) -
MATLAB: Version R2023a or later recommended
Installation¶
Download Helper Files¶
Download the MATLAB helper classes:
- lakefs.m - Core lakeFS operations
- everest.m - File system mounting
Place both files in your MATLAB project directory or add to your MATLAB path:
Configure lakeFS Connection¶
Create your lakeFS configuration by copying and editing the template:
lakefs_template.m:
classdef lakefs_template
% LAKEFS_TEMPLATE - Configuration template for lakeFS connection
%
% SETUP:
% 1. Copy this file to lakefs.m in your project directory
% 2. Fill in your credentials below
% 3. Keep lakefs.m out of version control (.gitignore)
properties(Constant)
% === LOCAL CONFIGURATION ===
LOCAL_ENDPOINT = 'http://localhost:8000'
LOCAL_ACCESS_KEY = 'YOUR_ACCESS_KEY_HERE'
LOCAL_SECRET_KEY = 'YOUR_SECRET_KEY_HERE'
% === CLOUD CONFIGURATION ===
CLOUD_ENDPOINT = 'https://your-org.us-east-1.lakefscloud.io'
CLOUD_ACCESS_KEY = 'YOUR_CLOUD_ACCESS_KEY_HERE'
CLOUD_SECRET_KEY = 'YOUR_CLOUD_SECRET_KEY_HERE'
% === CURRENT CONFIGURATION (will be set by configure()) ===
% These will be populated when you call lakefs.configure()
end
properties(Access=private)
current_endpoint
current_access_key
current_secret_key
end
methods(Static)
function configure(environment)
% CONFIGURE - Set active lakeFS environment
%
% Syntax:
% lakefs.configure('local') % Use local lakeFS
% lakefs.configure('cloud') % Use cloud lakeFS
end
end
end
Configuration steps:
- Copy
lakefs_template.mtolakefs.m - Edit credentials in
lakefs.m - Add
lakefs.mto.gitignoreto keep credentials private - Set active environment in your scripts:
% Use local lakeFS instance
lakefs.configure('local');
% Or use cloud lakeFS instance
lakefs.configure('cloud');
Basic Usage¶
Repository Operations¶
List branches:
% List all branches in a repository
branches = lakefs.list_branches('my-repo');
disp(branches.branch);
List tags:
View commit history:
% Get commit log for a branch
commits = lakefs.log('my-repo', 'main', 'amount', 5);
for i = 1:height(commits)
fprintf('%s: %s\n', commits.id{i}(1:12), commits.message{i});
end
Branch Management¶
Create a branch:
Compare branches:
% Show differences between branches
diff = lakefs.diff('my-repo', 'main', 'experiment-1');
fprintf('Changed objects: %d\n', height(diff));
Data Operations¶
Upload data:
% Upload local file to lakeFS branch
lakefs.upload('my-repo', 'experiment-1', ...
'data/results.mat', ... % Destination in lakeFS
'local/output/results.mat'); % Local source file
Download data:
% Download file from lakeFS
lakefs.download('my-repo', 'main', ...
'models/trained_model.mat', ... % Source in lakeFS
'local/models/model.mat'); % Local destination
Commit changes:
% Commit with metadata
metadata = struct();
metadata.accuracy = '0.945';
metadata.training_time = '120.5';
metadata.dataset = 'v1.2';
lakefs.commit('my-repo', 'experiment-1', ...
'Trained model with improved accuracy', ...
'metadata', metadata);
Mounting (Everest)¶
For direct file system access, use Everest to mount lakeFS paths:
% Mount a branch
everest.mount('lakefs://my-repo/main/', 'local_data/');
% Read files directly
data = load('local_data/results.mat');
% Unmount when done
everest.umount('local_data/');
Note: Mounting enables standard MATLAB file I/O functions to work with lakeFS data without explicit upload/download operations.
Core API Reference¶
lakefs.m - Repository Operations¶
Configuration¶
Description: Set active lakeFS environmentParameters: -
environment (string): 'local' or 'cloud'
Branch Operations¶
Description: List all branches in repositoryParameters: -
repo (string): Repository nameReturns: Table with branch names and commit IDs Description: Create new branch from source reference
Parameters: -
repo (string): Repository name
- branch_name (string): New branch name
- source_ref (string): Source branch or commit ID
Commit Operations¶
Description: Get commit historyParameters: -
repo (string): Repository name
- ref (string): Branch or commit reference
- Optional: 'amount', N - Limit number of commits returnedReturns: Table with commit IDs, messages, timestamps, metadata Description: Commit changes to branch
Parameters: -
repo (string): Repository name
- branch (string): Branch name
- message (string): Commit message
- Optional: 'metadata', struct - Attach metadata to commit
Data Transfer¶
Description: Upload local file to lakeFSParameters: -
repo (string): Repository name
- branch (string): Branch name
- lakefs_path (string): Destination path in lakeFS
- local_path (string): Source file on local system
Description: Download file from lakeFSParameters: -
repo (string): Repository name
- ref (string): Branch, tag, or commit ID
- lakefs_path (string): Source path in lakeFS
- local_path (string): Destination on local system
Comparison¶
Description: Compare two referencesParameters: -
repo (string): Repository name
- left_ref (string): First reference (branch/tag/commit)
- right_ref (string): Second referenceReturns: Table showing changed, added, removed objects
Tags¶
Description: List all tagsParameters: -
repo (string): Repository nameReturns: Table with tag names and commit IDs
everest.m - File System Operations¶
Mounting¶
Description: Mount lakeFS path as local directoryParameters: -
lakefs_uri (string): Full lakeFS URI (e.g., 'lakefs://repo/branch/path/')
- mount_dir (string): Local mount point
- Optional: 'presign', true - Use presigned URLs (default: true)
Description: Unmount directoryParameters: -
mount_dir (string): Local mount point to unmount
Description: Show active mountsReturns: Cell array of currently mounted directories Description: Check Everest installation and accessibility
Example Workflows¶
Experiment Tracking¶
% Configure environment
lakefs.configure('local');
% Create experiment branch
lakefs.create_branch('ml-project', 'experiment-3', 'main');
% Train model (your code here)
model = train_model(data);
save('outputs/model_v3.mat', 'model');
% Upload results
lakefs.upload('ml-project', 'experiment-3', ...
'models/model_v3.mat', 'outputs/model_v3.mat');
% Commit with metrics
metadata = struct();
metadata.accuracy = num2str(model.accuracy);
metadata.loss = num2str(model.loss);
metadata.epochs = num2str(model.epochs);
lakefs.commit('ml-project', 'experiment-3', ...
'Experiment 3: Increased learning rate', ...
'metadata', metadata);
% View experiment history
commits = lakefs.log('ml-project', 'experiment-3', 'amount', 1);
fprintf('Experiment accuracy: %s\n', commits.metadata{1}.accuracy);
Data Version Comparison¶
% Compare two data versions
diff = lakefs.diff('research-data', 'v1.0', 'v2.0');
% Show what changed
fprintf('Data changes between versions:\n');
for i = 1:height(diff)
fprintf(' %s: %s\n', diff.type{i}, diff.path{i});
end
% Download specific version
lakefs.download('research-data', 'v1.0', ...
'datasets/training_data.mat', 'data/v1_training.mat');
lakefs.download('research-data', 'v2.0', ...
'datasets/training_data.mat', 'data/v2_training.mat');
% Compare in MATLAB
v1 = load('data/v1_training.mat');
v2 = load('data/v2_training.mat');
fprintf('V1 samples: %d\n', size(v1.data, 1));
fprintf('V2 samples: %d\n', size(v2.data, 1));
Reproducible Analysis¶
% Record exact data version used
commits = lakefs.log('sensor-data', 'main', 'amount', 1);
data_commit = commits.id{1};
% Mount data
everest.mount('lakefs://sensor-data/main/', 'analysis_data/');
% Run analysis
data = load('analysis_data/sensor_readings.mat');
results = analyze_sensors(data);
% Unmount
everest.umount('analysis_data/');
% Store results with data lineage
lakefs.upload('sensor-data', 'analysis-results', ...
'results/analysis_output.mat', 'results/output.mat');
metadata = struct();
metadata.data_commit = data_commit; % Track source data version
metadata.analysis_date = datestr(now);
metadata.mean_value = num2str(mean(results.values));
lakefs.commit('sensor-data', 'analysis-results', ...
'Sensor analysis results', 'metadata', metadata);
% Later: Reproduce exact analysis
fprintf('Analysis used data from commit: %s\n', data_commit);
Advanced Topics¶
Environment Switching¶
Switch between local and cloud environments in the same session:
% Work with local instance
lakefs.configure('local');
local_branches = lakefs.list_branches('dev-repo');
% Switch to cloud
lakefs.configure('cloud');
cloud_branches = lakefs.list_branches('prod-repo');
Metadata Best Practices¶
Structure metadata for maximum value:
metadata = struct();
% Experiment details
metadata.experiment_id = 'exp-2024-001';
metadata.hypothesis = 'increased_batch_size';
% Quantitative results
metadata.accuracy = sprintf('%.4f', results.accuracy);
metadata.training_time_sec = num2str(results.time);
% Computational environment
metadata.matlab_version = version;
metadata.gpu_used = 'NVIDIA RTX 4090';
% Data lineage
metadata.training_data_commit = training_commit_id;
metadata.validation_data_commit = val_commit_id;
lakefs.commit(repo, branch, message, 'metadata', metadata);
Working with Large Datasets¶
For large datasets, prefer mounting over upload/download:
% Mount for direct access (no copying)
everest.mount('lakefs://large-data/main/datasets/', 'data/');
% Process data in chunks
datastore = imageDatastore('data/images', ...
IncludeSubfolders=true, ...
LabelSource="foldernames");
% Process without loading all into memory
while hasdata(datastore)
img = read(datastore);
process_image(img);
end
% Unmount when done
everest.umount('data/');
Troubleshooting¶
Connection Issues¶
Error: Cannot connect to lakeFS
Verify configuration:
% Check lakectl is configured
!lakectl repo list
% Test lakefs.m configuration
lakefs.configure('local'); % or 'cloud'
Mount Issues¶
Error: Mount failed
Verify Everest installation:
Mount not showing files
Wait for mount to complete:
everest.mount('lakefs://repo/branch/', 'data/');
pause(2); % Give mount time to initialize
dir('data/') % Should now show files
Path Issues¶
Error: File not found in lakeFS
Verify path format:
% Correct: Relative path from branch root
lakefs.upload('repo', 'branch', 'data/file.mat', 'local.mat');
% Incorrect: Leading slash
lakefs.upload('repo', 'branch', '/data/file.mat', 'local.mat');
Additional Resources¶
- lakeFS Documentation
- lakectl CLI Reference
- Everest Mount Reference
- lakeFS Samples Repository
- lakeFS Community Slack
Version Compatibility¶
| Component | Minimum Version | Recommended |
|---|---|---|
| MATLAB | R2020a | R2023a+ |
| lakeFS | 0.100.0 | Latest |
| lakectl | 0.100.0 | Latest |
| Everest | 0.1.0 | Latest |
Note: MATLAB R2023a introduced name=value syntax for function arguments. Earlier versions use 'name', value pairs. The helper files are compatible with both syntaxes.