Skip to content

Auditing

Info

This feature is available on lakeFS Cloud and lakeFS Enterprise

The lakeFS audit log allows you to view all relevant user action information in a clear and organized table, including when the action was performed, by whom, and what it was they did.

This can be useful for several purposes, including:

  1. Compliance - Audit logs can be used to show what data users accessed, as well as any changes they made to user management.

  2. Troubleshooting - If something changes on your underlying object store that you weren't expecting, such as a big file suddenly breaking into thousands of smaller files, you can use the audit log to find out what action led to this change.

Setting up access to Audit Logs on AWS S3

The access to the Audit Logs is done via AWS S3 Access Point.

There are different ways to interact with an access point (see Using access points in AWS).

The initial setup:

  1. Take note of the IAM Role ARN that will be used to access the data. This should be the user or role used by e.g. Athena.
  2. Reach out to customer success and provide this ARN. Once receiving the ARN role, an access point will be created and you should get in response the following details:
    1. S3 Bucket (e.g. arn:aws:s3:::lakefs-audit-logs-us-east-1-production)
    2. S3 URI to an access point (e.g. s3://arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>)
    3. Access Point alias. You can use this alias instead of the bucket name or Access Point ARN to access data through the Access Point. (e.g. lakefs-logs-<generated>-s3alias)
    4. Update your IAM Role policy and trust policy if required

A minimal example for IAM policy with 2 lakeFS installations in 2 regions (us-east-1, us-west-2):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::lakefs-audit-logs-us-east-1-production",
                "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/*",
                "arn:aws:s3:::lakefs-logs-<generated>-s3alias/*",
                "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>",
                "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/*"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "etl/v1/data/region=<region_a>/organization=org-<organization>/*",
                        "etl/v1/data/region=<region_b>/organization=org-<organization>/*"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::lakefs-audit-logs-us-east-1-production",
                "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/etl/v1/data/region=<region_a>/organization=org-<organization>/*",
                "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/etl/v1/data/region=<region_b>/organization=org-<organization>/*",
                "arn:aws:s3:::lakefs-logs-<generated>-s3alias/*",
                "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/object/etl/v1/data/region=<region_a>/organization=org-<organization>/*",
                "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/object/etl/v1/data/region=<region_b>/organization=org-<organization>/*"
            ]
        },
        {
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": [
                "arn:aws:kms:us-east-1:<treeverse-id>:key/<encryption-key-id>"
            ],
            "Effect": "Allow"
        }
    ]
}

Trust Policy example that allows anyone in your account to assume the role above:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<YOUR_ACCOUNT_ID>:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {}
        }
    ]
}

Authentication is done by assuming an IAM Role:

# Assume role use AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN:
aws sts assume-role --role-arn arn:aws:iam::<your-aws-account>:role/<reader-role> --role-session-name <name> 

# verify role assumed
aws sts get-caller-identity 

# list objects (can be used with --recursive) with access point ARN
aws s3 ls arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/etl/v1/data/region=<region>/organization=org-<organization>/

# get object locally via s3 access point alias 
aws s3api get-object --bucket lakefs-logs-<generated>-s3alias --key etl/v1/data/region=<region>/organization=org-<organization>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/<file>-snappy.parquet sample.parquet 

Data layout

Tip

The bucket name is important when creating the IAM policy but, the Access Point ARN and Alias will be the ones that are used to access the data (i.e AWS CLI, Spark etc).

Bucket Name: lakefs-audit-logs-us-east-1-production

Root prefix: etl/v1/data/region=<region>/organization=org-<organization-name>/

Files Path pattern: All the audit logs files are in parquet format and their pattern is: etl/v1/data/region=<region>/organization=org-<organization-name>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/*-snappy.parquet

Path Values

region: lakeFS installation region (e.g the region in lakeFS URL: https://..lakefscloud.io/)

organization: Found in the lakeFS URL https://<organization-name>.<region>.lakefscloud.io/. The value in the S3 path must be prefixed with org-<organization-name>

Partitions

  • year
  • month
  • day
  • hour

Example

As an example paths for "Acme" organization with 2 lakeFS installations:

# ACME in us-east-1 
etl/v1/data/region=us-east-1/organization=org-acme/year=2024/month=02/day=12/hour=13/log_abc-snappy.parquet

# ACME in us-west-2 
etl/v1/data/region=us-west-2/organization=org-acme/year=2024/month=02/day=12/hour=13/log_xyz-snappy.parquet

Schema

The files are in parquet format and can be accessed directly from Spark or any client that can read parquet files. Using Spark's printSchema() we can inspect the values, that’s the latest schema with comments on important columns:

column type description
data_user string the internal user ID for the user making the request. if using an external IdP (i.e SSO, Microsoft Entra, etc) it will be the UID represented by the IdP. (see below an example how to extract the info of external IDs in python)
data_repository string the repository ID relevant for this request. Currently only returned for s3_gateway requests
data_ref string the reference ID (tag, branch, ...) relevant for this request. Currently only returned for s3_gateway requests
data_status_code int HTTP status code returned for this request
data_service_name string Service name for the request. Could be either "rest_api" or "s3_gateway"
data_request_id string Unique ID representing this request
data_path string HTTP path used for this request
data_operation_id string Logical operation ID for this request. E.g. list_objects, delete_repository, ...
data_method string HTTP method for the request
data_time string datetime representing the start time of this request, in ISO 8601 format

IdP users: map user IDs from audit logs to an email in lakeFS

The data_user column in each log represents the user id that performed it.

  • It might be empty in cases where authentication is not required (e.g login attempt).
  • If the user is an API user created internally in lakeFS that id is also the name it was given.
  • data_user might contain an ID to an external IdP (i.e. SSO system), usually it is not human friendly, we can correlate the ID to a lakeFS email used, see an example using the Python lakefs-sdk.
import lakefs_sdk

# Configure HTTP basic authorization: basic_auth
configuration = lakefs_sdk.Configuration(
    host = "https://<org>.<region>.lakefscloud.io/api/v1",
    username = 'AKIA...',
    password = '...'
)

# Print all user email and uid in lakeFS 
# the uid is equal to the user id in the audit logs.
with lakefs_sdk.ApiClient(configuration) as api_client:
    auth_api = lakefs_sdk.AuthApi(api_client)
    has_more = True
    next_offset = ''
    page_size = 100 
    while has_more: 
        resp = auth_api.list_users(prefix='', after=next_offset, amount=page_size)
        for u in resp.results:
            email = u.email
            uid = u.id
            print(f'Email: {email}, UID: {uid}')

        has_more = resp.pagination.has_more 
        next_offset = resp.pagination.next_offset

Example: Glue Notebook with Spark

from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

# connect to s3 access point 
alias = 's3://<bucket-alias-name>'
s3_dyf = glueContext.create_dynamic_frame.from_options(
    format_options={},
    connection_type="s3",
    format="parquet",
    connection_options={
        "paths": [alias + "/etl/v1/data/region=<region>/organization=org-<org>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/"],
        "recurse": True,
    },
    transformation_ctx="sample-ctx",
)

s3_dyf.show()
s3_dyf.printSchema()

Audit Log Iceberg System Table (On-Premises)

lakeFS Enterprise can store audit events in a built-in Iceberg table, queryable through the lakeFS Iceberg REST catalog. No external pipeline is required to ingest or maintain audit data.

Supported storage backends

This feature requires Amazon S3 or Google Cloud Storage (GCS) as the blockstore backend. Azure Blob Storage is not currently supported — Azure deployments should use the log-based audit approach below.

When enabled, lakeFS automatically:

  1. Creates a system repository (lakefssystem) and an Iceberg table (system.audit_log)
  2. Captures every audit event (API requests and S3 Gateway operations)
  3. Flushes events to the Iceberg audit table
  4. Makes events immediately queryable through any Iceberg-compatible query engine (Spark, Trino, Athena, etc.)

Enabling Audit Logs

lakeFS Cloud

Audit log configuration is managed by Treeverse for lakeFS Cloud deployments. To enable the Iceberg audit log, contact support@treeverse.io.

Add the following to your lakeFS configuration:

audit_log:
  enabled: true
  retention_days: 90        # 0 = infinite retention
  storage_namespace: s3://my-bucket/lakefssystem  # where audit data is stored
  flush:
    interval: 1m            # how often to flush buffered events
    batch_size: 100000      # flush when this many events accumulate (whichever comes first)

Note

storage_namespace must point to a location in the same object store configured in your blockstore section. If omitted, lakeFS derives it from blockstore.default_namespace_prefix by appending /lakefssystem (e.g. s3://my-bucket/lakefssystem).

Access Control

Audit log access is governed by dedicated audit permissions, separate from generic catalog:* permissions:

Action Description Resource ARN
audit:ReadAuditLog Read/query the audit log table arn:lakefs:audit:::log
audit:WriteAuditLog Write to the audit log (system use only) arn:lakefs:audit:::log
  • Read access: Controlled by the AuditLogRead policy (action audit:ReadAuditLog). On new installations this policy is automatically attached to the Admins and SuperUsers groups. To grant access to other users or groups, attach the AuditLogRead policy shown below.
  • Writes are system-only. The lakefssystem repository is read-only. The audit service user (created during bootstrap) holds audit:* for ingestion, compaction, and expiry.
  • Self-audit exclusion: Events targeting the lakefssystem repository itself are excluded to prevent recursive self-auditing.

Setting up access on existing installations

If the AuditLogRead policy is not present (e.g. on installations created before this feature was available), create it manually using the example below and attach it to the relevant groups.

Example: The AuditLogRead policy

{
  "id": "AuditLogRead",
  "statement": [
    {
      "action": ["audit:ReadAuditLog"],
      "resource": "arn:lakefs:audit:::log",
      "effect": "allow"
    }
  ]
}

For more details on the RBAC model, see Role-Based Access Control.

Querying Audit Logs

Connect any Iceberg-compatible query engine to the lakeFS Iceberg REST catalog and query the audit log table.

The table is located at lakefssystem.main.system.audit_log where:

  • lakefssystem — the auto-created system repository
  • main — the default branch
  • system — the Iceberg namespace
  • audit_log — the table name

The following are examples of queries you can run on the audit log table:

-- Recent activity by a specific user
SELECT time, user, repository, ref, operation_id, path, status_code
FROM lakefssystem.main.system.audit_log
WHERE user = 'alice'
ORDER BY time DESC
LIMIT 50;

-- Top API operations in the last 7 days
SELECT operation_id, COUNT(*) AS calls
FROM lakefssystem.main.system.audit_log
WHERE time >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY operation_id
ORDER BY calls DESC
LIMIT 20;

-- Repository activity overview (last 24 hours)
SELECT repository, COUNT(*) AS operations
FROM lakefssystem.main.system.audit_log
WHERE time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY repository
ORDER BY operations DESC;

Schema

The Iceberg audit_log table uses the following schema:

Column Type Nullable Description
user string yes Username of the authenticated user (empty for unauthenticated)
repository string yes Repository name (empty if not applicable)
ref string yes Branch, tag, or commit reference
status_code int32 no HTTP response status code
service_name string no rest_api or s3_gateway
request_id string no Unique request identifier
path string yes HTTP request path
operation_id string no Logical operation (e.g., ListRepositories, GetObject)
method string no HTTP method (GET, POST, etc.)
source_ip string yes Client IP address and port
client string yes Client identifier (SDK name/version or User-Agent)
time timestamp (UTC) no Request timestamp

The table is partitioned by days(time) and repository for efficient querying.

Maintenance

A periodic maintenance job compacts small files, expires old snapshots, cleans up orphaned files, and commits changes. It can run in two modes:

  • In-process (default): a scheduler embedded in the lakeFS server runs maintenance automatically. This is the simplest deployment — enabled by default when audit_log.enabled is true.
  • External job: run lakefs audit maintain as a scheduled job in your own orchestration system (e.g. Kubernetes CronJob, cron, Airflow). To use this mode, set audit_log.maintenance.enabled: false in the server configuration to disable the in-process scheduler.

Note

Both modes use identical configuration and emit the same metrics, making it straightforward to switch between them.

In-process mode

The in-process scheduler runs inside the lakeFS server on a configurable cron schedule. A distributed lock ensures only one instance runs maintenance at a time across a multi-replica deployment.

audit_log:
  enabled: true
  maintenance:
    enabled: true              # default true, depends `audit_log.enabled` is true
    schedule: "0 * * * *"      # cron expression (default: every hour)

In-process maintenance metrics are scraped via the lakeFS server's /metrics endpoint alongside all other lakeFS metrics.

External job

Run lakefs audit maintain as a scheduled job using the same lakeFS Enterprise binary and configuration file:

lakefs audit maintain -c /etc/lakefs/config.yaml --retention-days 90

The job performs compaction, snapshot expiration, orphan cleanup, and commits the changes. Each step can be toggled with flags (--compact, --expire-snapshots, --cleanup-orphans, --commit — all default to true).

Exit codes:

Code Meaning
0 All steps completed successfully
1 One or more steps failed
2 Compaction failed

Use the exit code and the metrics below to monitor maintenance health.

Tuning flags:

Flag Default Description
--retention-days 90 Expire snapshots older than this (0 = no expiration)
--compact-min-files 3 Minimum small files to trigger compaction
--compact-max-small-file-size 33554432 Files smaller than this (bytes) are "small" (default 32MB)

When running as an external job, metrics can be pushed to a Prometheus Pushgateway after each run. Configure the push target via audit_log.maintenance.metrics_push_url in the lakeFS configuration file. If the URL is empty, no push happens.

Helm chart: The lakeFS Helm chart can deploy a Kubernetes CronJob when auditLog.maintenance.cronJob is set to true. It shares the same config file and license secret as the main deployment.

Metrics

Metric Type Description
audit_maintain_step_duration_seconds{step} histogram Duration of each maintenance step
audit_maintain_step_errors_total{step} counter Number of failed maintenance steps
audit_maintain_success_total counter Number of fully successful maintenance runs
audit_compaction_chunks_total counter Compaction chunks processed
audit_compaction_files_merged_total counter Source files merged during compaction
audit_compaction_bytes_merged_total counter Bytes merged during compaction

The {step} label takes the values: compaction, snapshot-expiration, orphan-cleanup, lakefs-commit, all-steps.

For configuration details, see the audit_log section in the Enterprise Configuration Reference.

Log-Based Audit (On-Premises)

For deployments using Azure Blob Storage or other blockstore types not supported by the Iceberg audit pipeline, audit events can be collected from lakeFS container stdout using a log collector.

Collection

Send logs to container stdout (the default) and use a log collector to capture and forward them. Filter for entries where log_audit equals true.

Common log collectors: Fluent Bit, Fluentd, Logstash.

Example audit log entry (JSON):

{
  "client": "lakefs-python-sdk/1.65.2",
  "host": "lakefs.example.com",
  "level": "info",
  "log_audit": true,
  "method": "POST",
  "msg": "HTTP call ended",
  "operation_id": "DeleteObjects",
  "path": "/api/v1/repositories/my-repo/branches/my-branch/objects/delete",
  "request_id": "1234567-5b66-7655-b4e8-2h0c271f6r90",
  "service_name": "rest_api",
  "source_ip": "80.0.0.10:34708",
  "status_code": 200,
  "time": "2025-12-25T12:30:32Z",
  "user": "lakefs-ci-bot"
}

Storage and Querying

Once collected, audit logs can be stored and queried using various approaches depending on your infrastructure. The choice of storage and query engine depends on your existing infrastructure, retention requirements, and query patterns.

Example: storage in S3/Azure Blob, ETL indexing with Spark and querying layer with Athena / Spark, or storing in Elasticsearch for real-time analysis.

Scaling Considerations

At high scale, lakeFS can generate a significant volume of audit logs. Consider the following:

  • Partitioning: Partition logs by time (e.g., year/month/day/hour) to improve query performance and manage storage costs
  • Retention policies: Define retention periods and lifecycle rules to archive or delete old logs

Schema Considerations

The audit log schema is stable and evolves additively over time. If you're using a schema-aware query engine (e.g., Athena, BigQuery), consider using a schema discovery mechanism such as AWS Glue Crawler or equivalent to automatically detect and update your table schema as new fields appear. The fields described in the Schema section above (remove the data_ prefix for on-prem) are expected to be present in all audit log entries.