Using R with lakeFS
R is a powerful language used widely in data science. lakeFS interfaces with R in two ways:
- To read and write data in lakeFS use standard S3 tools such as the
aws.s3
library. lakeFS has a S3 gateway which presents a lakeFS repository as an S3 bucket. - For working with lakeFS operations such as branches and commits use the API for which can be accessed from R using the
httr
library.
To see examples of R in action with lakeFS please visit the lakeFS-samples repository and the sample notebooks.
Reading and Writing from lakeFS with R
Working with data stored in lakeFS from R is the same as you would with an S3 bucket, via the S3 Gateway that lakeFS provides.
You can use any library that interfaces with S3. In this example we’ll use the aws.s3 library.
install.packages(c("aws.s3"))
library(aws.s3)
Configuration
The R S3 client documentation includes full details of the configuration options available. A good approach for using it with lakeFS set the endpoint and authentication details as environment variables:
Sys.setenv("AWS_ACCESS_KEY_ID" = "AKIAIOSFODNN7EXAMPLE",
"AWS_SECRET_ACCESS_KEY" = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"AWS_S3_ENDPOINT" = "lakefs.mycorp.com:8000")
Note: it is generally best practice to set these environment variables outside of the R script; it is done so here for convenience of the example.
In conjunction with this you must also specify region
and use_https
in each call of an aws.s3
function as these cannot be set globally. For example:
bucketlist(
region = "",
use_https = FALSE
)
region
should always be emptyuse_https
should be set toTRUE
orFALSE
depending on whether your lakeFS endpoint uses HTTPS.
Listing repositories
The S3 gateway exposes a repository as a bucket, and so using the aws.s3
function bucketlist
will return a list of available repositories on lakeFS:
bucketlist(
region = "",
use_https = FALSE
)
Writing to lakeFS from R
Assuming you’re using the aws.s3
library there various functions available including s3save
, s3saveRDS
, and put_object
. Here’s an example of writing an R object to lakeFS:
repo_name <- "example"
branch <- "development"
s3saveRDS(x=my_df,
bucket = repo_name,
object = paste0(branch,"/my_df.R"),
region = "",
use_https = FALSE)
You can also upload local files to lakeFS using R and the put_object
function:
repo_name <- "example"
branch <- "development"
local_file <- "/tmp/never.gonna"
put_object(file = local_file,
bucket = repo_name,
object = paste0(branch,"/give/you/up"),
region = "",
use_https = FALSE)
Reading from lakeFS with R
As with writing data from R to lakeFS, there is a similar set of functions for reading data. These include s3load
, s3readRDS
, and get_object
. Here’s an example of reading an R object from lakeFS:
repo_name <- "example"
branch <- "development"
my_df <- s3readRDS(bucket = repo_name,
object = paste0(branch,"/my_data.R"),
region = "",
use_https = FALSE)
Listing Objects
In general you should always specify a branch prefix when listing objects. Here’s an example to list the main
branch in the quickstart
repository:
get_bucket_df(bucket = "quickstart",
prefix = "main/",
region = "",
use_https = FALSE)
When listing objects in lakeFS there is a special case which is the repository/bucket level. When you list at this level you will get the branches returned as folders. These are not listed recursively, unless you list something under the branch. To understand more about this please refer to #5441
Working with Arrow
Arrow’s R library includes powerful support for data analysis, including reading and writing multiple file formats including Parquet, Arrow, CSV, and JSON. It has functionality for connecting to S3, and thus integrates perfectly with lakeFS.
To start with install and load the library
install.packages("arrow")
library(arrow)
Then create an S3FileSystem object to connect to your lakeFS instance
lakefs <- S3FileSystem$create(
endpoint_override = "lakefs.mycorp.com:8000",
scheme = "http"
access_key = "AKIAIOSFODNN7EXAMPLE",
secret_key = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
region = "",
)
From here you can list the contents of a particular lakeFS repository and branch
lakefs$ls(path = "quickstart/main")
To read a Parquet from lakeFS with R use the read_parquet
function
lakes <- read_parquet(lakefs$path("quickstart/main/lakes.parquet"))
Writing a file follows a similar pattern. Here is rewriting the same file as above but in Arrow format
write_feather(x = lakes,
sink = lakefs$path("quickstart/main/lakes.arrow"))
Performing lakeFS Operations using the lakeFS API from R
As well as reading and writing data, you will also want to carry out lakeFS operations from R including creating branches, committing data, and more.
To do this call the lakeFS API from the httr
library. You should refer to the API documentation for full details of the endpoints and their behaviour. Below are a few examples to illustrate the usage.
Check the lakeFS Server Version
This is a useful API call to establish connectivity and test authentication.
library(httr)
lakefs_api_url <- "lakefs.mycorp.com:8000/api/v1"
lakefsAccessKey <- "AKIAIOSFODNN7EXAMPLE"
lakefsSecretKey <- "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
r=GET(url=paste0(lakefs_api_url, "/config/version"),
authenticate(lakefsAccessKey, lakefsSecretKey))
The returned object r
can be inspected to determine the outcome of the operation by comparing it to the status codes specified in the API. Here is some example R code to demonstrate the idea:
if (r$status_code == 200) {
print(paste0("✅lakeFS credentials and connectivity verified. ℹ️lakeFS version ",content(r)$version))
} else {
print("🛑 failed to get lakeFS version")
print(content(r)$message)
}
Create a Repository
library(httr)
lakefs_api_url <- "lakefs.mycorp.com:8000/api/v1"
lakefsAccessKey <- "AKIAIOSFODNN7EXAMPLE"
lakefsSecretKey <- "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
repo_name <- "my_new_repo"
# Define the payload
body=list(name=repo_name,
storage_namespace="s3://example-bucket/foo")
# Call the API
r=POST(url=paste0(lakefs_api_url, "/repositories"),
authenticate(lakefsAccessKey, lakefsSecretKey),
body=body, encode="json")
Commit Data
library(httr)
lakefs_api_url <- "lakefs.mycorp.com:8000/api/v1"
lakefsAccessKey <- "AKIAIOSFODNN7EXAMPLE"
lakefsSecretKey <- "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
repo_name <- "my_new_repo"
branch <- "example"
# Define the payload
body=list(message="add some data and charts",
metadata=list(
client="httr",
author="rmoff"))
# Call the API
r=POST(url=paste0(lakefs_api_url, "/repositories/", repo_name, "/branches/", branch, "/commits"),
authenticate(lakefsAccessKey, lakefsSecretKey),
body=body, encode="json")