Using lakeFS with Presto/Trino
Presto and Trino are a distributed SQL query engines designed to query large data sets distributed over one or more heterogeneous data sources.
Querying data in lakeFS from Presto/Trino is similar to querying data in S3 from Presto/Trino. It is done using the Presto Hive connector or Trino Hive connector.
Credentials
In the following examples, we set AWS credentials at runtime for clarity. In production, these properties should be set using one of Hadoop’s standard ways of Authenticating with S3.
Configuration
Configure the Hive connector
Create /etc/catalog/hive.properties
with the following contents to mount the hive-hadoop2
connector as the Hive catalog, replacing example.net:9083
with the correct host and port for your Hive Metastore Thrift service:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://example.net:9083
Add the lakeFS configurations to /etc/catalog/hive.properties
in the corresponding S3 configuration properties:
hive.s3.aws-access-key=AKIAIOSFODNN7EXAMPLE
hive.s3.aws-secret-key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
hive.s3.endpoint=https://lakefs.example.com
hive.s3.path-style-access=true
Configure Hive
Presto/Trino uses Hive Metastore Service (HMS) or a compatible implementation of the Hive Metastore such as AWS Glue Data Catalog to write data to S3. In case you are using Hive Metastore, you will need to configure Hive as well.
In file hive-site.xml
add to the configuration:
<configuration>
...
<property>
<name>fs.s3a.access.key</name>
<value>AKIAIOSFODNN7EXAMPLE</value></property>
<property>
<name>fs.s3a.secret.key</name>
<value>wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>https://lakefs.example.com</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
</configuration>
Examples
Here are some examples based on examples from the Presto Hive connector examples and Trino Hive connector examples
Example with schema
Create a new schema named main
that will store tables in a lakeFS repository named example
branch: master
:
CREATE SCHEMA main
WITH (location = 's3a://example/main')
Create a new Hive table named page_views
in the web
schema stored using the ORC file format,
partitioned by date and country, and bucketed by user into 50
buckets (note that Hive requires the partition columns to be the last columns in the table):
CREATE TABLE main.page_views (
view_time timestamp,
user_id bigint,
page_url varchar,
ds date,
country varchar
)
WITH (
format = 'ORC',
partitioned_by = ARRAY['ds', 'country'],
bucketed_by = ARRAY['user_id'],
bucket_count = 50
)
Example with External table
Create an external Hive table named request_logs
that points at existing data in lakeFS:
CREATE TABLE main.request_logs (
request_time timestamp,
url varchar,
ip varchar,
user_agent varchar
)
WITH (
format = 'TEXTFILE',
external_location = 's3a://example/main/data/logs/'
)
Example of copying a table with metastore tools:
Deprecated Feature: Having heard the feedback from the community, we are planning to replace the below manual steps with an automated process. You can read more about it here.
Copy the created table page_views
on schema main
to schema example_branch
with location s3a://example/example_branch/page_views/
lakectl metastore copy --from-schema main --from-table page_views --to-branch example_branch