Skip to content

Auditing

Info

This feature is available on lakeFS Cloud and lakeFS Enterprise

The lakeFS audit log allows you to view all relevant user action information in a clear and organized table, including when the action was performed, by whom, and what it was they did.

This can be useful for several purposes, including:

  1. Compliance - Audit logs can be used to show what data users accessed, as well as any changes they made to user management.

  2. Troubleshooting - If something changes on your underlying object store that you weren't expecting, such as a big file suddenly breaking into thousands of smaller files, you can use the audit log to find out what action led to this change.

Setting up access to Audit Logs on AWS S3

The access to the Audit Logs is done via AWS S3 Access Point.

There are different ways to interact with an access point (see Using access points in AWS).

The initial setup:

  1. Take note of the IAM Role ARN that will be used to access the data. This should be the user or role used by e.g. Athena.
  2. Reach out to customer success and provide this ARN. Once receiving the ARN role, an access point will be created and you should get in response the following details:
    1. S3 Bucket (e.g. arn:aws:s3:::lakefs-audit-logs-us-east-1-production)
    2. S3 URI to an access point (e.g. s3://arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>)
    3. Access Point alias. You can use this alias instead of the bucket name or Access Point ARN to access data through the Access Point. (e.g. lakefs-logs-<generated>-s3alias)
    4. Update your IAM Role policy and trust policy if required

A minimal example for IAM policy with 2 lakeFS installations in 2 regions (us-east-1, us-west-2):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::lakefs-audit-logs-us-east-1-production",
                "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/*",
                "arn:aws:s3:::lakefs-logs-<generated>-s3alias/*",
                "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>",
                "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/*"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "etl/v1/data/region=<region_a>/organization=org-<organization>/*",
                        "etl/v1/data/region=<region_b>/organization=org-<organization>/*"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::lakefs-audit-logs-us-east-1-production",
                "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/etl/v1/data/region=<region_a>/organization=org-<organization>/*",
                "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/etl/v1/data/region=<region_b>/organization=org-<organization>/*",
                "arn:aws:s3:::lakefs-logs-<generated>-s3alias/*",
                "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/object/etl/v1/data/region=<region_a>/organization=org-<organization>/*",
                "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/object/etl/v1/data/region=<region_b>/organization=org-<organization>/*"
            ]
        },
        {
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": [
                "arn:aws:kms:us-east-1:<treeverse-id>:key/<encryption-key-id>"
            ],
            "Effect": "Allow"
        }
    ]
}

Trust Policy example that allows anyone in your account to assume the role above:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<YOUR_ACCOUNT_ID>:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {}
        }
    ]
}

Authentication is done by assuming an IAM Role:

# Assume role use AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN:
aws sts assume-role --role-arn arn:aws:iam::<your-aws-account>:role/<reader-role> --role-session-name <name> 

# verify role assumed
aws sts get-caller-identity 

# list objects (can be used with --recursive) with access point ARN
aws s3 ls arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/etl/v1/data/region=<region>/organization=org-<organization>/

# get object locally via s3 access point alias 
aws s3api get-object --bucket lakefs-logs-<generated>-s3alias --key etl/v1/data/region=<region>/organization=org-<organization>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/<file>-snappy.parquet sample.parquet 

Data layout

Tip

The bucket name is important when creating the IAM policy but, the Access Point ARN and Alias will be the ones that are used to access the data (i.e AWS CLI, Spark etc).

Bucket Name: lakefs-audit-logs-us-east-1-production

Root prefix: etl/v1/data/region=<region>/organization=org-<organization-name>/

Files Path pattern: All the audit logs files are in parquet format and their pattern is: etl/v1/data/region=<region>/organization=org-<organization-name>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/*-snappy.parquet

Path Values

region: lakeFS installation region (e.g the region in lakeFS URL: https://..lakefscloud.io/)

organization: Found in the lakeFS URL https://<organization-name>.<region>.lakefscloud.io/. The value in the S3 path must be prefixed with org-<organization-name>

Partitions

  • year
  • month
  • day
  • hour

Example

As an example paths for "Acme" organization with 2 lakeFS installations:

# ACME in us-east-1 
etl/v1/data/region=us-east-1/organization=org-acme/year=2024/month=02/day=12/hour=13/log_abc-snappy.parquet

# ACME in us-west-2 
etl/v1/data/region=us-west-2/organization=org-acme/year=2024/month=02/day=12/hour=13/log_xyz-snappy.parquet

Schema

The files are in parquet format and can be accessed directly from Spark or any client that can read parquet files. Using Spark's printSchema() we can inspect the values, that’s the latest schema with comments on important columns:

column type description
data_user string the internal user ID for the user making the request. if using an external IdP (i.e SSO, Microsoft Entra, etc) it will be the UID represented by the IdP. (see below an example how to extract the info of external IDs in python)
data_repository string the repository ID relevant for this request. Currently only returned for s3_gateway requests
data_ref string the reference ID (tag, branch, ...) relevant for this request. Currently only returned for s3_gateway requests
data_status_code int HTTP status code returned for this request
data_service_name string Service name for the request. Could be either "rest_api" or "s3_gateway"
data_request_id string Unique ID representing this request
data_path string HTTP path used for this request
data_operation_id string Logical operation ID for this request. E.g. list_objects, delete_repository, ...
data_method string HTTP method for the request
data_time string datetime representing the start time of this request, in ISO 8601 format

IdP users: map user IDs from audit logs to an email in lakeFS

The data_user column in each log represents the user id that performed it.

  • It might be empty in cases where authentication is not required (e.g login attempt).
  • If the user is an API user created internally in lakeFS that id is also the name it was given.
  • data_user might contain an ID to an external IdP (i.e. SSO system), usually it is not human friendly, we can correlate the ID to a lakeFS email used, see an example using the Python lakefs-sdk.
import lakefs_sdk

# Configure HTTP basic authorization: basic_auth
configuration = lakefs_sdk.Configuration(
    host = "https://<org>.<region>.lakefscloud.io/api/v1",
    username = 'AKIA...',
    password = '...'
)

# Print all user email and uid in lakeFS 
# the uid is equal to the user id in the audit logs.
with lakefs_sdk.ApiClient(configuration) as api_client:
    auth_api = lakefs_sdk.AuthApi(api_client)
    has_more = True
    next_offset = ''
    page_size = 100 
    while has_more: 
        resp = auth_api.list_users(prefix='', after=next_offset, amount=page_size)
        for u in resp.results:
            email = u.email
            uid = u.id
            print(f'Email: {email}, UID: {uid}')

        has_more = resp.pagination.has_more 
        next_offset = resp.pagination.next_offset

Example: Glue Notebook with Spark

from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

# connect to s3 access point 
alias = 's3://<bucket-alias-name>'
s3_dyf = glueContext.create_dynamic_frame.from_options(
    format_options={},
    connection_type="s3",
    format="parquet",
    connection_options={
        "paths": [alias + "/etl/v1/data/region=<region>/organization=org-<org>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/"],
        "recurse": True,
    },
    transformation_ctx="sample-ctx",
)

s3_dyf.show()
s3_dyf.printSchema()

Audit Logs for lakeFS Enterprise On-Premises

In the case of On-Premises deployment, you are responsible for collecting, storing, and querying audit logs.

Reference Architecture

An audit log pipeline for lakeFS Enterprise consists of three stages:

  1. Collect: Capture logs from lakeFS container stdout and filter for log_audit: true
  2. Store: Ship filtered logs to durable storage (e.g., S3, GCS, Azure Blob)
  3. Query: Use a query engine to analyze logs (e.g., Athena, BigQuery, Elasticsearch)

Collection

The recommended practice is to send logs to container stdout (the default configuration) and use a log collector to capture and forward them. See the logging configuration for more details.

Common log collectors include: Fluent Bit, Fluentd, Logstash.

When configuring your log collector, filter for entries where log_audit equals true to separate audit logs from regular application logs.

lakeFS outputs two kinds of logs:

  • Regular logs: API errors, event descriptions, and debugging information
  • Audit logs: User actions such as creating a branch, committing, or deleting a repository

Note

To filter audit logs, use the boolean field log_audit. When log_audit is true, the log entry is an audit event.

Example audit log entry (JSON):

{
  "client": "lakefs-python-sdk/1.65.2",
  "file": "usr/local/go/src/net/http/server.go:2322",
  "func": "net/http.HandlerFunc.ServeHTTP",
  "host": "lakefs.example.com",
  "level": "info",
  "log_audit": true,
  "method": "POST",
  "msg": "HTTP call ended",
  "operation_id": "DeleteObjects",
  "path": "/api/v1/repositories/my-repo/branches/my-branch/objects/delete",
  "request_id": "1234567-5b66-7655-b4e8-2h0c271f6r90",
  "sent_bytes": 14,
  "service_name": "rest_api",
  "source_ip": "80.0.0.10:34708",
  "status_code": 200,
  "time": "2025-12-25T12:30:32Z",
  "took": 600747890,
  "took_str": "600.74789ms",
  "user": "lakefs-ci-bot"
}

Storage and Querying

Once collected, audit logs can be stored and queried using various approaches depending on your infrastructure. The choice of storage and query engine depends on your existing infrastructure, retention requirements, and query patterns.

Example: storage in S3, ETL indexing with Spark and querying layer with Athena / Spark, or storing in Elasticsearch for real-time analysis.

Scaling Considerations

At high scale, lakeFS can generate a significant volume of audit logs. Consider the following:

  • Partitioning: Partition logs by time (e.g., year/month/day/hour) to improve query performance and manage storage costs
  • Retention policies: Define retention periods and lifecycle rules to archive or delete old logs

Schema Considerations

The audit log schema is stable and evolves additively over time. If you're using a schema-aware query engine (e.g., Athena, BigQuery), consider using a schema discovery mechanism such as AWS Glue Crawler or equivalent to automatically detect and update your table schema as new fields appear. The fields described in the Schema section above (remove the data_ prefix for on-prem) are expected to be present in all audit log entries.