This part contains a brief explanation about how Glue/Hive metastore work with lakeFS
Glue and Hive Metastore stores metadata related to Hive and other services (such as Spark and Trino). They contain metadata such as the location of the table, information about columns, partitions and many more.
In order to query the table
my_table, Spark will:
- Request the metadata from Hive metastore (steps 1,2)
- Use the location from the metadata to access the data in S3 (steps 3,4).
When using lakeFS, the flow stays exactly the same. Note that the location of the table
my_table now contains the branch
When creating a table in Glue/Hive metastore (using a client such as Spark, Hive, Presto), we specify the table location.
Consider the table
my_table which was created with the location
Assume we created a new branch called
main as the source branch.
The data from
s3://example/main/path/to/table is now accessible in
The metadata is not managed in lakeFS, meaning we don’t have any table pointing to
To address this, lakeFS introduces
lakectl metastore commands. The case above could be handled using the copy command: it can create a copy of
my_table with data located in
s3://example/DEV/path/to/table. Note that this is a fast, metadata-only operation.
lakectl metastore commands could run on Glue or Hive metastore.
Add the following to the lakectl configuration file (by default
metastore: type: hive hive: uri: thrift://hive-metastore:9083
metastore: type: glue glue: catalog-id: 123456789012 region: us-east-1 profile: default # optional, implies using a credentials file credentials: access_key_id: AKIAIOSFODNN7EXAMPLE secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Notice: It’s recommended to set type and catalog-id/metastore-uri in the lakectl configuration file.
For simplicity, we recommend creating a schema for each branch, this way you can use the same table name across different schemas.
after creating branch
example_branch also create a schema named
For a table named
my_table under the schema
main, create a new table by the same name under the schema
example_branch. You now have two
my_tables, one in the main schema and one, in the branch schema.
Metastore tools support three commands:
copy and diff could work both on Glue and on Hive.
create-symlink works only on Glue.
to-table are not specified, the destination branch and source table names will be used as per the suggested model.
Notice: Metastore commands can only run on tables located in lakeFS, you should not use tables that are not located in lakeFS.
copy command creates a copy of a table pointing to the defined branch.
In case the destination table already exists, the command will only merge the changes.
Suppose we created the table
inventory on branch
main on schema
CREATE EXTERNAL TABLE `inventory`( `inv_item_sk` int, `inv_warehouse_sk` int, `inv_quantity_on_hand` int) PARTITIONED BY ( `inv_date_sk` int) STORED AS ORC LOCATION 's3a://my_repo/main/path/to/table';
We create a new lakeFS branch
lakectl branch create lakefs://my_repo/example_branch --source lakefs://my_repo/main
The data from
s3://my_repo/main/path/to/table is now accessible in
In order to query the data in
we would like to create a copy of the table
inventory in schema
example_branch pointing to the new branch.
lakectl metastore copy --from-schema default --from-table inventory --to-schema example_branch --to-table inventory --to-branch example_branch
After running this command, query the table
example_branch.inventory to get the data from
After adding a partition to the branch table, we may want to copy the partition to the main table.
For example, for the new partition
2020-08-01, run the following in order to copy the partition to the main table:
lakectl metastore copy --type hive --from-schema example_branch --from-table inventory --to-schema default --to-table inventory --to-branch main -p 2020-08-01
For a table partitioned by more than one column, specify the partition flag for every column. For example for the partition
lakectl metastore copy --from-schema example_branch --from-table branch_inventory --to-schema default --to-branch main -p 2020 -p 08 -p 01
Provides a 2-way diff between two tables.
+ , removed
- and changed
~ partitions and columns.
Suppose that we made some changes on the copied table
inventory on schema
example_branch and we want to view the changes before merging back to
inventory on schema
lakectl metastore diff --type hive --address thrift://hive-metastore:9083 --from-schema example_branch --from-table branch --to-schema default --to-table inventory
The output will be something like:
Columns are identical Partitions - 2020-07-04 + 2020-07-05 + 2020-07-06 ~ 2020-07-08
Athena doesn’t support configuring the endpoint-uri. to use S3-compatible services like lakeFS. Hence, Athena can’t access lakeFS, and can only be used with AWS S3 as the storage.
In order to enable accessing partitioned data we could use the
create-symlink receives a source table, destination table and the location of the table and does two actions:
- Creates partitioned directories with symlink files in the underlying S3 bucket.
- Creates a table in Glue catalog with symlink format type and location pointing to the created symlinks.
Notice: create-symlink source table must point to a location in lakeFS.
Let’s assume we have the table
inventory in Glue.
The table is pointing to repo
main and the data is located at
We want to query the table using Athena. To do this, we run the command:
lakectl metastore create-symlink --address 123456789012 --branch main --from-schema default --from-table branch_inventory --to-schema default --to-table sym_inventory --repo example-repository --path path/to/table/in/lakeFS
We can now use Amazon Athena to query the created table