Link Search Menu Expand Document

Versioning HuggingFace Datasets with lakeFS

HuggingFace 🤗 Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.

🤗 Datasets supports access to cloud storage providers through fsspec FileSystem implementations.

lakefs-spec is a community implementation of an fsspec Filesystem that fully leverages lakeFS’ capabilities. Let’s start by installing it:

Installation

pip install lakefs-spec

Configuration

If you’ve already configured the lakeFS python SDK and/or lakectl, you should have a $HOME/.lakectl.yaml file that contains your access credentials and endpoint for your lakeFS environment.

Otherwise, install lakectl and run lakectl config to set up your access credentials.

Reading a Dataset

To read a dataset, all we have to do is use a lakefs://... URI when calling load_dataset:

>>> from datasets import load_dataset
>>> 
>>> dataset = load_dataset('csv', data_files='lakefs://example-repository/my-branch/data/example.csv')

That’s it! this should automatically load the lakefs-spec implementation that we’ve installed, which will use the $HOME/.lakectl.yaml file to read its credentials, so we don’t need to pass additional configuration.

Saving/Loading

Once we’ve loaded a Dataset, we can save it using the save_to_disk method as normal:

>>> dataset.save_to_disk('lakefs://example-repository/my-branch/datasets/example/')

At this point, we might want to commit that change to lakeFS, and tag it, so we could share it with our colleagues.

We can do it through the UI or lakectl, but let’s do it with the lakeFS Python SDK:

>>> import lakefs
>>>
>>> repo = lakefs.repository('example-repository')
>>> commit = repo.branch('my-branch').commit(
...     'saved my first huggingface Dataset!',
...     metadata={'using': '🤗'})
>>> repo.tag('alice_experiment1').create(commit)

Now, others on our team can load our exact dataset by using the tag we created:

>>> from datasets import load_from_disk
>>>
>>> dataset = load_from_disk('lakefs://example-repository/alice_experiment1/datasets/example/')