Link Search Menu Expand Document

Using lakeFS with SageMaker

Amazon SageMaker helps prepare, build, train and deploy ML models quickly by bringing together a broad set of capabilities purpose-built for ML.

Table of contents

  1. Initializing session and client
  2. Usage Examples
    1. Upload train and test data
    2. Download objects

Initializing session and client

Initialize a Sagemaker session and an s3 client with lakeFS as the endpoint:

import sagemaker
import boto3

endpoint_url = '<LAKEFS_S3_GATEWAY_ENDPOINT>'
aws_access_key_id = '<LAKEFS_ACCESS_KEY_ID>'
aws_secret_access_key = '<LAKEFS_SECRET_ACCESS_KEY>'
repo = 'example-repo'

sm = boto3.client('sagemaker',

s3_resource = boto3.resource('s3',

session = sagemaker.Session(boto3.Session(), sagemaker_client=sm, default_bucket=repo)
session.s3_resource = s3_resource

Usage Examples

Upload train and test data

Let’s use the created session for uploading data to the ‘main’ branch:

prefix = "/prefix-within-branch"
branch = 'main'

train_file = 'train_data.csv';
train_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=branch + prefix + "/train")

test_file = 'test_data.csv';
test_data_no_target.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=branch + prefix + "/test")

Download objects

We can use the integration with lakeFS to download a portion of the data we see fit:

repo = 'example-repo'
prefix = "/prefix-to-download"
branch = 'main'
localpath = './' + branch

session.download_data(path=localpath, bucket=repo, key_prefix = branch + prefix)

Note: Advanced AWS SageMaker features, like Autopilot jobs, are encapsulated and don’t have the option to override the S3 endpoint. However, it is possible to export the required inputs from lakeFS to S3.
If you’re using SageMaker features that aren’t supported by lakeFS, we’d love to hear from you.