Using lakeFS with EMR

Amazon EMR is a managed cluster platform that simplifies running Big Data frameworks, such as Apache Hadoop and Apache Spark.

Configuration

To configure Spark on EMR to work with lakeFS, you need to set the lakeFS credentials and endpoint in the appropriate fields. The exact configuration keys depends on the application running in EMR, but their format follows this form:

lakeFS endpoint: *.fs.s3a.endpoint

lakeFS access key: *.fs.s3a.access.key

lakeFS secret key: *.fs.s3a.secret.key

EMR will encourage users to use s3:// with Spark as it will use EMR’s proprietary driver. Users need to use s3a:// for this guide to work.

The Spark job reads and writes will be directed to the lakeFS instance, using the S3 gateway.

You can choose from two options for configuring an EMR cluster to work with lakeFS:

When you create a cluster - All the steps will use the cluster configuration. No specific configuration needed when adding a step.
Configuring on each step - A cluster is created with the default S3 configuration. Each step using lakeFS should pass the appropriate config params.

Configuration on cluster creation

Use the below configuration when creating the cluster. You may delete any app configuration that is not suitable for your use case.

[{
   "Classification": "presto-connector-hive",
   "Properties": {
      "hive.s3.aws-access-key": "AKIAIOSFODNN7EXAMPLE",
      "hive.s3.aws-secret-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "hive.s3.endpoint": "https://lakefs.example.com",
      "hive.s3.path-style-access": "true",
      "hive.s3-file-system-type": "PRESTO"
   }
},
   {
      "Classification": "hive-site",
      "Properties": {
         "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
         "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
         "fs.s3.endpoint": "https://lakefs.example.com",
         "fs.s3.path.style.access": "true",
         "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
         "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
         "fs.s3a.endpoint": "https://lakefs.example.com",
         "fs.s3a.path.style.access": "true"
      }
   },
   {
      "Classification": "hdfs-site",
      "Properties": {
         "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
         "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
         "fs.s3.endpoint": "https://lakefs.example.com",
         "fs.s3.path.style.access": "true",
         "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
         "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
         "fs.s3a.endpoint": "https://lakefs.example.com",
         "fs.s3a.path.style.access": "true"
      }
   },
   {
      "Classification": "core-site",
      "Properties": {
         "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
         "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
         "fs.s3.endpoint": "https://lakefs.example.com",
         "fs.s3.path.style.access": "true",
         "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
         "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
         "fs.s3a.endpoint": "https://lakefs.example.com",
         "fs.s3a.path.style.access": "true"
      }
   },
   {
      "Classification": "emrfs-site",
      "Properties": {
         "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
         "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
         "fs.s3.endpoint": "https://lakefs.example.com",
         "fs.s3.path.style.access": "true",
         "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
         "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
         "fs.s3a.endpoint": "https://lakefs.example.com",
         "fs.s3a.path.style.access": "true"
      }
   },
   {
      "Classification": "mapred-site",
      "Properties": {
         "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
         "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
         "fs.s3.endpoint": "https://lakefs.example.com",
         "fs.s3.path.style.access": "true",
         "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
         "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
         "fs.s3a.endpoint": "https://lakefs.example.com",
         "fs.s3a.path.style.access": "true"
      }
   },
   {
      "Classification": "spark-defaults",
      "Properties": {
         "spark.sql.catalogImplementation": "hive"
      }
   }
]

Configuration on adding a step

When a cluster was created without the above configuration, you can still use lakeFS when adding a step.

For example, when creating a Spark job:

aws emr add-steps --cluster-id j-197B3AEGQ9XE4 \
  --steps="Type=Spark,Name=SparkApplication,ActionOnFailure=CONTINUE, \
  Args=[--conf,spark.hadoop.fs.s3a.access.key=AKIAIOSFODNN7EXAMPLE, \
  --conf,spark.hadoop.fs.s3a.secret.key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY, \
  --conf,spark.hadoop.fs.s3a.endpoint=https://lakefs.example.com, \
  --conf,spark.hadoop.fs.s3a.path.style.access=true, \
  s3a://<lakefs-repo>/<lakefs-branch>/path/to/jar]"

The Spark context in the running job will already be initialized to use the provided lakeFS configuration. There’s no need to repeat the configuration steps mentioned in Using lakeFS with Spark