Link Search Menu Expand Document

Exporting Data

The export operation copies all data from a given lakeFS commit to a designated object store location.

For instance, the contents lakefs://example/main might be exported on s3://company-bucket/example/latest. Clients entirely unaware of lakeFS could use that base URL to access latest files on main. Clients aware of lakeFS can continue to use the lakeFS S3 endpoint to access repository files on s3://example/main, as well as other versions and uncommitted versions.

Possible use-cases:

  1. External consumers of data don’t have access to your lakeFS installation.
  2. Some data pipelines in the organization are not fully migrated to lakeFS.
  3. You want to experiment with lakeFS as a side-by-side installation first.
  4. Create copies of your data lake in other regions (taking into account read pricing).

How to use

Using spark-submit

You can use the export main in 3 different modes:

  1. Export all objects from branch example-branch on example-repo repository to s3 location s3://example-bucket/prefix/:

    .... example-repo s3://example-bucket/prefix/ --branch=example-branch
    
  2. Export all objects from a commit c805e49bafb841a0875f49cd555b397340bbd9b8 on example-repo repository to s3 location s3://example-bucket/prefix/:

    .... example-repo s3://example-bucket/prefix/ --commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8
    
  3. Export only the diff between branch example-branch and commit c805e49bafb841a0875f49cd555b397340bbd9b8 on example-repo repository to s3 location s3://example-bucket/prefix/:

    .... example-repo s3://example-bucket/prefix/ --branch=example-branch --prev_commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8
    

The complete spark-submit command would look like:

spark-submit --conf spark.hadoop.lakefs.api.url=https://<LAKEFS_ENDPOINT>/api/v1 \
  --conf spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY_ID> \
  --conf spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_ACCESS_KEY> \
  --packages io.lakefs:lakefs-spark-client-301_2.12:0.1.0 \
  --class io.treeverse.clients.Main export-app example-repo s3://example-bucket/prefix \
  --branch=example-branch

The command assumes the spark cluster has permissions to write to s3://example-bucket/prefix. Otherwise, add spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key with the proper credentials.

Using custom code (notebook/spark)

Set up lakeFS Spark metadata client with the endpoint and credentials as instructed in the previous page.

The client exposes the Exporter object with 3 export options:

  1. Export all objects at the HEAD of a given branch. Does not include files that were added to that branch, but were not committed.
exportAllFromBranch(branch: String)
  1. Export ALL objects from a commit:
exportAllFromCommit(commitID: String)
  1. Export just the diff between a commit and the HEAD of a branch. This is the ideal option for continuous exports of a branch, as it will change only the files that have been changed since the previous commit.

    exportFrom(branch: String, prevCommitID: String)
    

Success/Failure Indications

When the Spark export operation ends, an additional status file will be added to the root object storage destination. If all files were exported successfully the file path will be of form: EXPORT_<commitID>_<ISO-8601-time-UTC>_SUCCESS. For failures: the form will beEXPORT_<commitID>_<ISO-8601-time-UTC>_FAILURE, and the file will include a log of the failed files operations.

Export Rounds (Spark success files)

Some files should be exported before others, e.g. a Spark _SUCCESS file exported before other files under the same prefix might send the wrong indication.

The export operation may contain several rounds within the same export. A failing round will stop the export of all the files of the next rounds.

By default, lakeFS will use the SparkFilter and have 2 rounds for each export. The first round will export any non Spark _SUCCESS files. Second round will export all Spark’s _SUCCESS files. You may override the default behaviour by passing a custom filter to the Exporter.

Example

  1. First configure the Exporter instance:

    import io.treeverse.clients.{ApiClient, Exporter}
    import org.apache.spark.sql.SparkSession
    
    val endpoint = "http://<LAKEFS_ENDPOINT>/api/v1"
    val accessKey = "<LAKEFS_ACCESS_KEY_ID>"
    val secretKey = "<LAKEFS_SECRET_ACCESS_KEY>"
    
    val repo = "example-repo"
    
    val spark = SparkSession.builder().appName("I can export").master("local").getOrCreate()
    val sc = spark.sparkContext
    sc.hadoopConfiguration.set("lakefs.api.url", endpoint)
    sc.hadoopConfiguration.set("lakefs.api.access_key", accessKey)
    sc.hadoopConfiguration.set("lakefs.api.secret_key", secretKey)
    
    // Add any required spark context configuration for s3
    val rootLocation = "s3://company-bucket/example/latest"
    
    val apiClient = new ApiClient(endpoint, accessKey, secretKey)
    val exporter = new Exporter(spark, apiClient, repo, rootLocation)
    
  2. Now you can export all objects from main branch to s3://company-bucket/example/latest:

    val branch = "main"
    exporter.exportAllFromBranch(branch)
    
  3. Assuming a previous successful export on commit f3c450d8cd0e84ac67e7bc1c5dcde9bef82d8ba7, you can alternatively export just the difference between main branch and the commit:

    val branch = "main"
    val commit = "f3c450d8cd0e84ac67e7bc1c5dcde9bef82d8ba7"
    exporter.exportFrom(branch, commit)