Link Search Menu Expand Document

Copying Data to/from lakeFS with DistCp

Apache Hadoop DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. You can easily use it with your lakeFS repositories.

Note

In the following examples, we set AWS credentials on the command line for clarity. In production, you should set these properties using one of Hadoop’s standard ways of Authenticating with S3.

Copying from lakeFS to lakeFS

You can use DistCP to copy between two different lakeFS repositories. Replace the access key pair with your lakeFS access key pair:

hadoop distcp \
  -Dfs.s3a.path.style.access=true \
  -Dfs.s3a.access.key="AKIAIOSFODNN7EXAMPLE" \
  -Dfs.s3a.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
  -Dfs.s3a.endpoint="https://lakefs.example.com" \
  "s3a://example-repo-1/main/example-file.parquet" \
  "s3a://example-repo-2/main/example-file.parquet"

val workDir = s”s3a://${repo}/${branch}/collection/shows” val dataPath = s”$workDir/title.basics.parquet”

Copying between S3 and lakeFS

To copy between an S3 bucket and lakeFS repository, use Hadoop’s per-bucket configuration. In the following examples, replace the first access key pair with your lakeFS key pair, and the second one with your AWS IAM key pair:

From S3 to lakeFs

hadoop distcp \
  -Dfs.s3a.path.style.access=true \
  -Dfs.s3a.bucket.example-repo.access.key="AKIAIOSFODNN7EXAMPLE" \
  -Dfs.s3a.bucket.example-repo.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
  -Dfs.s3a.bucket.example-repo.endpoint="https://lakefs.example.com" \
  -Dfs.s3a.bucket.example-bucket.access.key="AKIAIOSFODNN3EXAMPLE" \
  -Dfs.s3a.bucket.example-bucket.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \
  "s3a://example-bucket/example-file.parquet" \
  "s3a://example-repo/main/example-file.parquet"

From lakeFS to S3

hadoop distcp \
  -Dfs.s3a.path.style.access=true \
  -Dfs.s3a.bucket.example-repo.access.key="AKIAIOSFODNN7EXAMPLE" \
  -Dfs.s3a.bucket.example-repo.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
  -Dfs.s3a.bucket.example-repo.endpoint="https://lakefs.example.com" \
  -Dfs.s3a.bucket.example-bucket.access.key="AKIAIOSFODNN3EXAMPLE" \
  -Dfs.s3a.bucket.example-bucket.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \
  "s3a://example-repo/main/myfile" \
  "s3a://example-bucket/myfile"