Auditing¶
Info
This feature is available on lakeFS Cloud and lakeFS Enterprise
The lakeFS audit log allows you to view all relevant user action information in a clear and organized table, including when the action was performed, by whom, and what it was they did.
This can be useful for several purposes, including:
-
Compliance - Audit logs can be used to show what data users accessed, as well as any changes they made to user management.
-
Troubleshooting - If something changes on your underlying object store that you weren't expecting, such as a big file suddenly breaking into thousands of smaller files, you can use the audit log to find out what action led to this change.
Setting up access to Audit Logs on AWS S3¶
The access to the Audit Logs is done via AWS S3 Access Point.
There are different ways to interact with an access point (see Using access points in AWS).
The initial setup:
- Take note of the IAM Role ARN that will be used to access the data. This should be the user or role used by e.g. Athena.
- Reach out to customer success and provide this ARN. Once receiving the ARN role, an access point will be created and you should get in response the following details:
- S3 Bucket (e.g.
arn:aws:s3:::lakefs-audit-logs-us-east-1-production) - S3 URI to an access point (e.g.
s3://arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>) - Access Point alias. You can use this alias instead of the bucket name or Access Point ARN to access data through the Access Point. (e.g.
lakefs-logs-<generated>-s3alias) - Update your IAM Role policy and trust policy if required
- S3 Bucket (e.g.
A minimal example for IAM policy with 2 lakeFS installations in 2 regions (us-east-1, us-west-2):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::lakefs-audit-logs-us-east-1-production",
"arn:aws:s3:::lakefs-audit-logs-us-east-1-production/*",
"arn:aws:s3:::lakefs-logs-<generated>-s3alias/*",
"arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>",
"arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/*"
],
"Condition": {
"StringLike": {
"s3:prefix": [
"etl/v1/data/region=<region_a>/organization=org-<organization>/*",
"etl/v1/data/region=<region_b>/organization=org-<organization>/*"
]
}
}
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": [
"arn:aws:s3:::lakefs-audit-logs-us-east-1-production",
"arn:aws:s3:::lakefs-audit-logs-us-east-1-production/etl/v1/data/region=<region_a>/organization=org-<organization>/*",
"arn:aws:s3:::lakefs-audit-logs-us-east-1-production/etl/v1/data/region=<region_b>/organization=org-<organization>/*",
"arn:aws:s3:::lakefs-logs-<generated>-s3alias/*",
"arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/object/etl/v1/data/region=<region_a>/organization=org-<organization>/*",
"arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/object/etl/v1/data/region=<region_b>/organization=org-<organization>/*"
]
},
{
"Action": [
"kms:Decrypt"
],
"Resource": [
"arn:aws:kms:us-east-1:<treeverse-id>:key/<encryption-key-id>"
],
"Effect": "Allow"
}
]
}
Trust Policy example that allows anyone in your account to assume the role above:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<YOUR_ACCOUNT_ID>:root"
},
"Action": "sts:AssumeRole",
"Condition": {}
}
]
}
Authentication is done by assuming an IAM Role:
# Assume role use AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN:
aws sts assume-role --role-arn arn:aws:iam::<your-aws-account>:role/<reader-role> --role-session-name <name>
# verify role assumed
aws sts get-caller-identity
# list objects (can be used with --recursive) with access point ARN
aws s3 ls arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/etl/v1/data/region=<region>/organization=org-<organization>/
# get object locally via s3 access point alias
aws s3api get-object --bucket lakefs-logs-<generated>-s3alias --key etl/v1/data/region=<region>/organization=org-<organization>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/<file>-snappy.parquet sample.parquet
Data layout¶
Tip
The bucket name is important when creating the IAM policy but, the Access Point ARN and Alias will be the ones that are used to access the data (i.e AWS CLI, Spark etc).
Bucket Name: lakefs-audit-logs-us-east-1-production
Root prefix: etl/v1/data/region=<region>/organization=org-<organization-name>/
Files Path pattern: All the audit logs files are in parquet format and their pattern is: etl/v1/data/region=<region>/organization=org-<organization-name>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/*-snappy.parquet
Path Values¶
region: lakeFS installation region (e.g the region in lakeFS URL: https://
organization: Found in the lakeFS URL https://<organization-name>.<region>.lakefscloud.io/. The value in the S3 path must be prefixed with org-<organization-name>
Partitions¶
yearmonthdayhour
Example¶
As an example paths for "Acme" organization with 2 lakeFS installations:
# ACME in us-east-1
etl/v1/data/region=us-east-1/organization=org-acme/year=2024/month=02/day=12/hour=13/log_abc-snappy.parquet
# ACME in us-west-2
etl/v1/data/region=us-west-2/organization=org-acme/year=2024/month=02/day=12/hour=13/log_xyz-snappy.parquet
Schema¶
The files are in parquet format and can be accessed directly from Spark or any client that can read parquet files.
Using Spark's printSchema() we can inspect the values, that’s the latest schema with comments on important columns:
| column | type | description |
|---|---|---|
data_user |
string | the internal user ID for the user making the request. if using an external IdP (i.e SSO, Microsoft Entra, etc) it will be the UID represented by the IdP. (see below an example how to extract the info of external IDs in python) |
data_repository |
string | the repository ID relevant for this request. Currently only returned for s3_gateway requests |
data_ref |
string | the reference ID (tag, branch, ...) relevant for this request. Currently only returned for s3_gateway requests |
data_status_code |
int | HTTP status code returned for this request |
data_service_name |
string | Service name for the request. Could be either "rest_api" or "s3_gateway" |
data_request_id |
string | Unique ID representing this request |
data_path |
string | HTTP path used for this request |
data_operation_id |
string | Logical operation ID for this request. E.g. list_objects, delete_repository, ... |
data_method |
string | HTTP method for the request |
data_time |
string | datetime representing the start time of this request, in ISO 8601 format |
IdP users: map user IDs from audit logs to an email in lakeFS¶
The data_user column in each log represents the user id that performed it.
- It might be empty in cases where authentication is not required (e.g login attempt).
- If the user is an API user created internally in lakeFS that id is also the name it was given.
data_usermight contain an ID to an external IdP (i.e. SSO system), usually it is not human friendly, we can correlate the ID to a lakeFS email used, see an example using the Python lakefs-sdk.
import lakefs_sdk
# Configure HTTP basic authorization: basic_auth
configuration = lakefs_sdk.Configuration(
host = "https://<org>.<region>.lakefscloud.io/api/v1",
username = 'AKIA...',
password = '...'
)
# Print all user email and uid in lakeFS
# the uid is equal to the user id in the audit logs.
with lakefs_sdk.ApiClient(configuration) as api_client:
auth_api = lakefs_sdk.AuthApi(api_client)
has_more = True
next_offset = ''
page_size = 100
while has_more:
resp = auth_api.list_users(prefix='', after=next_offset, amount=page_size)
for u in resp.results:
email = u.email
uid = u.id
print(f'Email: {email}, UID: {uid}')
has_more = resp.pagination.has_more
next_offset = resp.pagination.next_offset
Example: Glue Notebook with Spark
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
# connect to s3 access point
alias = 's3://<bucket-alias-name>'
s3_dyf = glueContext.create_dynamic_frame.from_options(
format_options={},
connection_type="s3",
format="parquet",
connection_options={
"paths": [alias + "/etl/v1/data/region=<region>/organization=org-<org>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/"],
"recurse": True,
},
transformation_ctx="sample-ctx",
)
s3_dyf.show()
s3_dyf.printSchema()
Audit Log Iceberg System Table (On-Premises)¶
lakeFS Enterprise can store audit events in a built-in Iceberg table, queryable through the lakeFS Iceberg REST catalog. No external pipeline is required to ingest or maintain audit data.
Supported storage backends
This feature requires Amazon S3 or Google Cloud Storage (GCS) as the blockstore backend. Azure Blob Storage is not currently supported — Azure deployments should use the log-based audit approach below.
When enabled, lakeFS automatically:
- Creates a system repository (
lakefssystem) and an Iceberg table (system.audit_log) - Captures every audit event (API requests and S3 Gateway operations)
- Flushes events to the Iceberg audit table
- Makes events immediately queryable through any Iceberg-compatible query engine (Spark, Trino, Athena, etc.)
Enabling Audit Logs¶
lakeFS Cloud
Audit log configuration is managed by Treeverse for lakeFS Cloud deployments. To enable the Iceberg audit log, contact support@treeverse.io.
Add the following to your lakeFS configuration:
audit_log:
enabled: true
retention_days: 90 # 0 = infinite retention
storage_namespace: s3://my-bucket/lakefssystem # where audit data is stored
flush:
interval: 1m # how often to flush buffered events
batch_size: 100000 # flush when this many events accumulate (whichever comes first)
Note
storage_namespace must point to a location in the same object store configured in your blockstore section. If omitted, lakeFS derives it from blockstore.default_namespace_prefix by appending /lakefssystem (e.g. s3://my-bucket/lakefssystem).
Access Control¶
Audit log access is governed by dedicated audit permissions, separate from generic catalog:* permissions:
| Action | Description | Resource ARN |
|---|---|---|
audit:ReadAuditLog |
Read/query the audit log table | arn:lakefs:audit:::log |
audit:WriteAuditLog |
Write to the audit log (system use only) | arn:lakefs:audit:::log |
- Read access: Controlled by the
AuditLogReadpolicy (actionaudit:ReadAuditLog). On new installations this policy is automatically attached to theAdminsandSuperUsersgroups. To grant access to other users or groups, attach theAuditLogReadpolicy shown below. - Writes are system-only. The
lakefssystemrepository is read-only. The audit service user (created during bootstrap) holdsaudit:*for ingestion, compaction, and expiry. - Self-audit exclusion: Events targeting the
lakefssystemrepository itself are excluded to prevent recursive self-auditing.
Setting up access on existing installations
If the AuditLogRead policy is not present (e.g. on installations created before this feature was available), create it manually using the example below and attach it to the relevant groups.
Example: The AuditLogRead policy
{
"id": "AuditLogRead",
"statement": [
{
"action": ["audit:ReadAuditLog"],
"resource": "arn:lakefs:audit:::log",
"effect": "allow"
}
]
}
For more details on the RBAC model, see Role-Based Access Control.
Querying Audit Logs¶
Connect any Iceberg-compatible query engine to the lakeFS Iceberg REST catalog and query the audit log table.
The table is located at lakefssystem.main.system.audit_log where:
lakefssystem— the auto-created system repositorymain— the default branchsystem— the Iceberg namespaceaudit_log— the table name
The following are examples of queries you can run on the audit log table:
-- Recent activity by a specific user
SELECT time, user, repository, ref, operation_id, path, status_code
FROM lakefssystem.main.system.audit_log
WHERE user = 'alice'
ORDER BY time DESC
LIMIT 50;
-- Top API operations in the last 7 days
SELECT operation_id, COUNT(*) AS calls
FROM lakefssystem.main.system.audit_log
WHERE time >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY operation_id
ORDER BY calls DESC
LIMIT 20;
-- Repository activity overview (last 24 hours)
SELECT repository, COUNT(*) AS operations
FROM lakefssystem.main.system.audit_log
WHERE time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY repository
ORDER BY operations DESC;
Schema¶
The Iceberg audit_log table uses the following schema:
| Column | Type | Nullable | Description |
|---|---|---|---|
user |
string | yes | Username of the authenticated user (empty for unauthenticated) |
repository |
string | yes | Repository name (empty if not applicable) |
ref |
string | yes | Branch, tag, or commit reference |
status_code |
int32 | no | HTTP response status code |
service_name |
string | no | rest_api or s3_gateway |
request_id |
string | no | Unique request identifier |
path |
string | yes | HTTP request path |
operation_id |
string | no | Logical operation (e.g., ListRepositories, GetObject) |
method |
string | no | HTTP method (GET, POST, etc.) |
source_ip |
string | yes | Client IP address and port |
client |
string | yes | Client identifier (SDK name/version or User-Agent) |
time |
timestamp (UTC) | no | Request timestamp |
The table is partitioned by days(time) and repository for efficient querying.
Maintenance¶
A periodic maintenance job compacts small files, expires old snapshots, cleans up orphaned files, and commits changes. It can run in two modes:
- In-process (default): a scheduler embedded in the lakeFS server runs maintenance automatically. This is the simplest deployment — enabled by default when
audit_log.enabledistrue. - External job: run
lakefs audit maintainas a scheduled job in your own orchestration system (e.g. Kubernetes CronJob, cron, Airflow). To use this mode, setaudit_log.maintenance.enabled: falsein the server configuration to disable the in-process scheduler.
Note
Both modes use identical configuration and emit the same metrics, making it straightforward to switch between them.
In-process mode¶
The in-process scheduler runs inside the lakeFS server on a configurable cron schedule. A distributed lock ensures only one instance runs maintenance at a time across a multi-replica deployment.
audit_log:
enabled: true
maintenance:
enabled: true # default true, depends `audit_log.enabled` is true
schedule: "0 * * * *" # cron expression (default: every hour)
In-process maintenance metrics are scraped via the lakeFS server's /metrics endpoint alongside all other lakeFS metrics.
External job¶
Run lakefs audit maintain as a scheduled job using the same lakeFS Enterprise binary and configuration file:
The job performs compaction, snapshot expiration, orphan cleanup, and commits the changes. Each step can be toggled with flags (--compact, --expire-snapshots, --cleanup-orphans, --commit — all default to true).
Exit codes:
| Code | Meaning |
|---|---|
| 0 | All steps completed successfully |
| 1 | One or more steps failed |
| 2 | Compaction failed |
Use the exit code and the metrics below to monitor maintenance health.
Tuning flags:
| Flag | Default | Description |
|---|---|---|
--retention-days |
90 | Expire snapshots older than this (0 = no expiration) |
--compact-min-files |
3 | Minimum small files to trigger compaction |
--compact-max-small-file-size |
33554432 | Files smaller than this (bytes) are "small" (default 32MB) |
When running as an external job, metrics can be pushed to a Prometheus Pushgateway after each run. Configure the push target via audit_log.maintenance.metrics_push_url in the lakeFS configuration file. If the URL is empty, no push happens.
Helm chart: The lakeFS Helm chart can deploy a Kubernetes CronJob when auditLog.maintenance.cronJob is set to true. It shares the same config file and license secret as the main deployment.
Metrics¶
| Metric | Type | Description |
|---|---|---|
audit_maintain_step_duration_seconds{step} |
histogram | Duration of each maintenance step |
audit_maintain_step_errors_total{step} |
counter | Number of failed maintenance steps |
audit_maintain_success_total |
counter | Number of fully successful maintenance runs |
audit_compaction_chunks_total |
counter | Compaction chunks processed |
audit_compaction_files_merged_total |
counter | Source files merged during compaction |
audit_compaction_bytes_merged_total |
counter | Bytes merged during compaction |
The {step} label takes the values: compaction, snapshot-expiration, orphan-cleanup, lakefs-commit, all-steps.
For configuration details, see the audit_log section in the Enterprise Configuration Reference.
Log-Based Audit (On-Premises)¶
For deployments using Azure Blob Storage or other blockstore types not supported by the Iceberg audit pipeline, audit events can be collected from lakeFS container stdout using a log collector.
Collection¶
Send logs to container stdout (the default) and use a log collector to capture and forward them. Filter for entries where log_audit equals true.
Common log collectors: Fluent Bit, Fluentd, Logstash.
Example audit log entry (JSON):
{
"client": "lakefs-python-sdk/1.65.2",
"host": "lakefs.example.com",
"level": "info",
"log_audit": true,
"method": "POST",
"msg": "HTTP call ended",
"operation_id": "DeleteObjects",
"path": "/api/v1/repositories/my-repo/branches/my-branch/objects/delete",
"request_id": "1234567-5b66-7655-b4e8-2h0c271f6r90",
"service_name": "rest_api",
"source_ip": "80.0.0.10:34708",
"status_code": 200,
"time": "2025-12-25T12:30:32Z",
"user": "lakefs-ci-bot"
}
Storage and Querying¶
Once collected, audit logs can be stored and queried using various approaches depending on your infrastructure. The choice of storage and query engine depends on your existing infrastructure, retention requirements, and query patterns.
Example: storage in S3/Azure Blob, ETL indexing with Spark and querying layer with Athena / Spark, or storing in Elasticsearch for real-time analysis.
Scaling Considerations
At high scale, lakeFS can generate a significant volume of audit logs. Consider the following:
- Partitioning: Partition logs by time (e.g.,
year/month/day/hour) to improve query performance and manage storage costs - Retention policies: Define retention periods and lifecycle rules to archive or delete old logs
Schema Considerations
The audit log schema is stable and evolves additively over time. If you're using a schema-aware query engine (e.g., Athena, BigQuery), consider using a schema discovery mechanism such as AWS Glue Crawler or equivalent to automatically detect and update your table schema as new fields appear.
The fields described in the Schema section above (remove the data_ prefix for on-prem) are expected to be present in all audit log entries.