Use Python to interact with your objects on lakeFS
There are two primary ways to work with lakeFS from Python:
- Use Boto to perform object operations through the lakeFS S3 gateway.
- Use the lakeFS SDK to perform versioning and other lakeFS-specific operations.
Using the lakeFS SDK
Installing
Install the Python client using pip:
pip install 'lakefs_client==<lakeFS version>'
Initializing
Here’s how to instantiate a client:
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient
# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = 'AKIAIOSFODNN7EXAMPLE'
configuration.password = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
configuration.host = 'http://localhost:8000'
client = LakeFSClient(configuration)
For testing SSL endpoints you may wish to use a self-signed certificate. If you do this and receive an SSL: CERTIFICATE_VERIFY_FAILED
error message you might add the following configuration to your client:
configuration.verify_ssl = False
This setting allows well-known “man-in-the-middle”, impersonation, and credential stealing attacks. Never use this in any production setting.
Optionally, to enable communication via proxies, simply set the proxy configuration:
configuration.ssl_ca_cert = <path to a file of concatenated CA certificates in PEM format> # Set this to customize the certificate file to verify the peer
configuration.proxy = <proxy server URL>
Usage Examples
Now that you have a client object, you can use it to interact with the API.
Creating a repository
repo = models.RepositoryCreation(name='example-repo', storage_namespace='s3://storage-bucket/repos/example-repo', default_branch='main')
client.repositories_api.create_repository(repo)
# output:
# {'creation_date': 1617532175,
# 'default_branch': 'main',
# 'id': 'example-repo',
# 'storage_namespace': 's3://storage-bucket/repos/example-repo'}
Creating a branch, uploading files, committing changes
List the repository branches:
client.branches_api.list_branches('example-repo')
# output:
# [{'commit_id': 'cdd673a4c5f42d33acdf3505ecce08e4d839775485990d231507f586ebe97656', 'id': 'main'}]
Create a new branch:
client.branches_api.create_branch(repository='example-repo', branch_creation=models.BranchCreation(name='experiment-aggregations1', source='main'))
# output:
# 'cdd673a4c5f42d33acdf3505ecce08e4d839775485990d231507f586ebe97656'
List again to see your newly created branch:
client.branches_api.list_branches('example-repo').results
# output:
# [{'commit_id': 'cdd673a4c5f42d33acdf3505ecce08e4d839775485990d231507f586ebe97656', 'id': 'experiment-aggregations1'}, {'commit_id': 'cdd673a4c5f42d33acdf3505ecce08e4d839775485990d231507f586ebe97656', 'id': 'main'}]
Great. Now, let’s upload a file into your new branch:
with open('file.csv', 'rb') as f:
client.objects_api.upload_object(repository='example-repo', branch='experiment-aggregations1', path='path/to/file.csv', content=f)
# output:
# {'checksum': '0d3b39380e2500a0f60fb3c09796fdba',
# 'mtime': 1617534834,
# 'path': 'path/to/file.csv',
# 'path_type': 'object',
# 'physical_address': 'local://example-repo/1865650a296c42e28183ad08e9b068a3',
# 'size_bytes': 18}
Diffing a single branch will show all the uncommitted changes on that branch:
client.branches_api.diff_branch(repository='example-repo', branch='experiment-aggregations1').results
# output:
# [{'path': 'path/to/file.csv', 'path_type': 'object', 'type': 'added'}]
As expected, our change appears here. Let’s commit it and attach some arbitrary metadata:
client.commits_api.commit(
repository='example-repo',
branch='experiment-aggregations1',
commit_creation=models.CommitCreation(message='Added a CSV file!', metadata={'using': 'python_api'}))
# output:
# {'committer': 'barak',
# 'creation_date': 1617535120,
# 'id': 'e80899a5709509c2daf797c69a6118be14733099f5928c14d6b65c9ac2ac841b',
# 'message': 'Added a CSV file!',
# 'meta_range_id': '',
# 'metadata': {'using': 'python_api'},
# 'parents': ['cdd673a4c5f42d33acdf3505ecce08e4d839775485990d231507f586ebe97656']}
Diffing again, this time there should be no uncommitted files:
client.branches_api.diff_branch(repository='example-repo', branch='experiment-aggregations1').results
# output:
# []
Merging changes from a branch into main
Let’s diff between your branch and the main branch:
client.refs_api.diff_refs(repository='example-repo', left_ref='main', right_ref='experiment-aggregations1').results
# output:
# [{'path': 'path/to/file.csv', 'path_type': 'object', 'type': 'added'}]
Looks like you have a change. Let’s merge it:
client.refs_api.merge_into_branch(repository='example-repo', source_ref='experiment-aggregations1', destination_branch='main')
# output:
# {'reference': 'd0414a3311a8c1cef1ef355d6aca40db72abe545e216648fe853e25db788fa2e',
# 'summary': {'added': 1, 'changed': 0, 'conflict': 0, 'removed': 0}}
Let’s diff again - there should be no changes as all changes are on our main branch already:
client.refs_api.diff_refs(repository='example-repo', left_ref='main', right_ref='experiment-aggregations1').results
# output:
# []
Python Client documentation
For the documentation of lakeFS’s Python package, see https://pydocs.lakefs.io
Full API reference
For a full reference of the lakeFS API, see lakeFS API
Using Boto
💡 To use Boto with lakeFS alongside S3, check out Boto S3 Router. It will route requests to either S3 or lakeFS according to the provided bucket name.
lakeFS exposes an S3-compatible API, so you can use Boto to interact with your objects on lakeFS.
Initializing
Create a Boto3 S3 client with your lakeFS endpoint and key-pair:
import boto3
s3 = boto3.client('s3',
endpoint_url='https://lakefs.example.com',
aws_access_key_id='AKIAIOSFODNN7EXAMPLE',
aws_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY')
The client is now configured to operate on your lakeFS installation.
Usage Examples
Put an object into lakeFS
Use a branch name and a path to put an object in lakeFS:
with open('/local/path/to/file_0', 'rb') as f:
s3.put_object(Body=f, Bucket='example-repo', Key='main/example-file.parquet')
You can now commit this change using the lakeFS UI or CLI.
List objects
List the branch objects starting with a prefix:
list_resp = s3.list_objects_v2(Bucket='example-repo', Prefix='main/example-prefix')
for obj in list_resp['Contents']:
print(obj['Key'])
Or, use a lakeFS commit ID to list objects for a specific commit:
list_resp = s3.list_objects_v2(Bucket='example-repo', Prefix='c7a632d74f/example-prefix')
for obj in list_resp['Contents']:
print(obj['Key'])
Get object metadata
Get object metadata using branch and path:
s3.head_object(Bucket='example-repo', Key='main/example-file.parquet')
# output:
# {'ResponseMetadata': {'RequestId': '72A9EBD1210E90FA',
# 'HostId': '',
# 'HTTPStatusCode': 200,
# 'HTTPHeaders': {'accept-ranges': 'bytes',
# 'content-length': '1024',
# 'etag': '"2398bc5880e535c61f7624ad6f138d62"',
# 'last-modified': 'Sun, 24 May 2020 10:42:24 GMT',
# 'x-amz-request-id': '72A9EBD1210E90FA',
# 'date': 'Sun, 24 May 2020 10:45:42 GMT'},
# 'RetryAttempts': 0},
# 'AcceptRanges': 'bytes',
# 'LastModified': datetime.datetime(2020, 5, 24, 10, 42, 24, tzinfo=tzutc()),
# 'ContentLength': 1024,
# 'ETag': '"2398bc5880e535c61f7624ad6f138d62"',
# 'Metadata': {}}