Copying Data to/from lakeFS with DistCp
Apache Hadoop DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. You can easily use it with your lakeFS repositories.
Note
In the following examples, we set AWS credentials on the command line for clarity. In production, you should set these properties using one of Hadoop’s standard ways of Authenticating with S3.
Copying from lakeFS to lakeFS
You can use DistCP to copy between two different lakeFS repositories. Replace the access key pair with your lakeFS access key pair:
hadoop distcp \
-Dfs.s3a.path.style.access=true \
-Dfs.s3a.access.key="AKIAIOSFODNN7EXAMPLE" \
-Dfs.s3a.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
-Dfs.s3a.endpoint="https://lakefs.example.com" \
"s3a://example-repo-1/main/example-file.parquet" \
"s3a://example-repo-2/main/example-file.parquet"
val workDir = s”s3a://${repo}/${branch}/collection/shows” val dataPath = s”$workDir/title.basics.parquet”
Copying between S3 and lakeFS
To copy between an S3 bucket and lakeFS repository, use Hadoop’s per-bucket configuration. In the following examples, replace the first access key pair with your lakeFS key pair, and the second one with your AWS IAM key pair:
From S3 to lakeFs
hadoop distcp \
-Dfs.s3a.path.style.access=true \
-Dfs.s3a.bucket.example-repo.access.key="AKIAIOSFODNN7EXAMPLE" \
-Dfs.s3a.bucket.example-repo.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
-Dfs.s3a.bucket.example-repo.endpoint="https://lakefs.example.com" \
-Dfs.s3a.bucket.example-bucket.access.key="AKIAIOSFODNN3EXAMPLE" \
-Dfs.s3a.bucket.example-bucket.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \
"s3a://example-bucket/example-file.parquet" \
"s3a://example-repo/main/example-file.parquet"
From lakeFS to S3
hadoop distcp \
-Dfs.s3a.path.style.access=true \
-Dfs.s3a.bucket.example-repo.access.key="AKIAIOSFODNN7EXAMPLE" \
-Dfs.s3a.bucket.example-repo.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
-Dfs.s3a.bucket.example-repo.endpoint="https://lakefs.example.com" \
-Dfs.s3a.bucket.example-bucket.access.key="AKIAIOSFODNN3EXAMPLE" \
-Dfs.s3a.bucket.example-bucket.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \
"s3a://example-repo/main/myfile" \
"s3a://example-bucket/myfile"