Exporting Data
The export operation copies all data from a given lakeFS commit to a designated object store location.
For instance, the contents lakefs://example/main
might be exported on
s3://company-bucket/example/latest
. Clients entirely unaware of lakeFS could use that
base URL to access latest files on main
. Clients aware of lakeFS can continue to use
the lakeFS S3 endpoint to access repository files on s3://example/main
, as well as
other versions and uncommitted versions.
Possible use-cases:
- External consumers of data don’t have access to your lakeFS installation.
- Some data pipelines in the organization are not fully migrated to lakeFS.
- You want to experiment with lakeFS as a side-by-side installation first.
- Create copies of your data lake in other regions (taking into account read pricing).
How to use
Using spark-submit
You can use the export main in 3 different modes:
-
Export all objects from branch
example-branch
onexample-repo
repository to s3 locations3://example-bucket/prefix/
:.... example-repo s3://example-bucket/prefix/ --branch=example-branch
-
Export all objects from a commit
c805e49bafb841a0875f49cd555b397340bbd9b8
onexample-repo
repository to s3 locations3://example-bucket/prefix/
:.... example-repo s3://example-bucket/prefix/ --commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8
-
Export only the diff between branch
example-branch
and commitc805e49bafb841a0875f49cd555b397340bbd9b8
onexample-repo
repository to s3 locations3://example-bucket/prefix/
:.... example-repo s3://example-bucket/prefix/ --branch=example-branch --prev_commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8
The complete spark-submit
command would look like:
spark-submit --conf spark.hadoop.lakefs.api.url=https://<LAKEFS_ENDPOINT>/api/v1 \
--conf spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY_ID> \
--conf spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_ACCESS_KEY> \
--packages io.lakefs:lakefs-spark-client-301_2.12:0.1.0 \
--class io.treeverse.clients.Main export-app example-repo s3://example-bucket/prefix \
--branch=example-branch
The command assumes the spark cluster has permissions to write to s3://example-bucket/prefix
.
Otherwise, add spark.hadoop.fs.s3a.access.key
and spark.hadoop.fs.s3a.secret.key
with the proper credentials.
Using custom code (notebook/spark)
Set up lakeFS Spark metadata client with the endpoint and credentials as instructed in the previous page.
The client exposes the Exporter
object with 3 export options:
- Export all objects at the HEAD of a given branch. Does not include files that were added to that branch, but were not committed.
exportAllFromBranch(branch: String)
- Export ALL objects from a commit:
exportAllFromCommit(commitID: String)
-
Export just the diff between a commit and the HEAD of a branch. This is the ideal option for continuous exports of a branch, as it will change only the files that have been changed since the previous commit.
exportFrom(branch: String, prevCommitID: String)
Success/Failure Indications
When the Spark export operation ends, an additional status file will be added to the root
object storage destination.
If all files were exported successfully the file path will be of form: EXPORT_<commitID>_<ISO-8601-time-UTC>_SUCCESS
.
For failures: the form will beEXPORT_<commitID>_<ISO-8601-time-UTC>_FAILURE
, and the file will include a log of the failed files operations.
Export Rounds (Spark success files)
Some files should be exported before others, e.g. a Spark _SUCCESS
file exported before other files under
the same prefix might send the wrong indication.
The export operation may contain several rounds within the same export.
A failing round will stop the export of all the files of the next rounds
.
By default, lakeFS will use the SparkFilter
and have 2 rounds
for each export.
The first round will export any non Spark _SUCCESS
files. Second round will export all Spark’s _SUCCESS
files.
You may override the default behaviour by passing a custom filter
to the Exporter
.
Example
-
First configure the
Exporter
instance:import io.treeverse.clients.{ApiClient, Exporter} import org.apache.spark.sql.SparkSession val endpoint = "http://<LAKEFS_ENDPOINT>/api/v1" val accessKey = "<LAKEFS_ACCESS_KEY_ID>" val secretKey = "<LAKEFS_SECRET_ACCESS_KEY>" val repo = "example-repo" val spark = SparkSession.builder().appName("I can export").master("local").getOrCreate() val sc = spark.sparkContext sc.hadoopConfiguration.set("lakefs.api.url", endpoint) sc.hadoopConfiguration.set("lakefs.api.access_key", accessKey) sc.hadoopConfiguration.set("lakefs.api.secret_key", secretKey) // Add any required spark context configuration for s3 val rootLocation = "s3://company-bucket/example/latest" val apiClient = new ApiClient(endpoint, accessKey, secretKey) val exporter = new Exporter(spark, apiClient, repo, rootLocation)
-
Now you can export all objects from
main
branch tos3://company-bucket/example/latest
:val branch = "main" exporter.exportAllFromBranch(branch)
-
Assuming a previous successful export on commit
f3c450d8cd0e84ac67e7bc1c5dcde9bef82d8ba7
, you can alternatively export just the difference betweenmain
branch and the commit:val branch = "main" val commit = "f3c450d8cd0e84ac67e7bc1c5dcde9bef82d8ba7" exporter.exportFrom(branch, commit)