Monitoring using Prometheus
Example prometheus.yml
lakeFS exposes metrics through the same port used by the lakeFS service, using the standard /metrics
path.
An example prometheus.yml
could look like this:
scrape_configs:
- job_name: lakeFS
scrape_interval: 10s
metrics_path: /metrics
static_configs:
- targets:
- lakefs.example.com:8000
Metrics exposed by lakeFS
By default, Prometheus exports metrics with OS process information like memory and CPU. It also includes Go-specific metrics such as details about GC and a number of goroutines. You can learn about these default metrics in this post.
In addition, lakeFS exposes the following metrics to help monitor your deployment:
Name in Prometheus | Description | Labels |
api_requests_total | lakeFS API requests (counter) | code: http status method: http method |
api_request_duration_seconds | Durations of lakeFS API requests (histogram) | operation: name of API operation code: http status |
gateway_request_duration_seconds | lakeFS S3-compatible endpoint request (histogram) | operation: name of gateway operation code: http status |
s3_operation_duration_seconds | Outgoing S3 operations (histogram) | operation: operation name error: “true” if error, “false” otherwise |
gs_operation_duration_seconds | Outgoing Google Storage operations (histogram) | operation: operation name error: “true” if error, “false” otherwise |
azure_operation_duration_seconds | Outgoing Azure storage operations (histogram) | operation: operation name error: “true” if error, “false” otherwise |
kv_request_duration_seconds | Durations of KV requests(histogram) | operation: name of KV operation type: KV type(dynamodb, postgres, etc) |
dynamo_request_duration_seconds | Time spent doing DynamoDB requests | operation: DynamoDB operation name |
dynamo_consumed_capacity_total | The capacity units consumed by operation | operation: DynamoDB operation name |
dynamo_failures_total | The total number of errors while working for kv store | operation: DynamoDB operation name |
pgxpool_acquire_count | PostgreSQL cumulative count of successful acquires from the pool | db_name default to the kv table name (kv) |
pgxpool_acquire_duration_ns | PostgreSQL total duration of all successful acquires from the pool in nanoseconds | db_name default to the kv table name (kv) |
pgxpool_acquired_conns | PostgreSQL number of currently acquired connections in the pool | db_name default to the kv table name (kv) |
pgxpool_canceled_acquire_count | PostgreSQL cumulative count of acquires from the pool that were canceled by a context | db_name default to the kv table name (kv) |
pgxpool_constructing_conns | PostgreSQL number of conns with construction in progress in the pool | db_name default to the kv table name (kv) |
pgxpool_empty_acquire | PostgreSQL cumulative count of successful acquires from the pool that waited for a resource to be released or constructed because the pool was empty | db_name default to the kv table name (kv) |
pgxpool_idle_conns | PostgreSQL number of currently idle conns in the pool | db_name default to the kv table name (kv) |
pgxpool_max_conns | PostgreSQL maximum size of the pool | db_name default to the kv table name (kv) |
pgxpool_total_conns | PostgreSQL total number of resources currently in the pool | db_name default to the kv table name (kv) |
Example queries
Note: when using Prometheus functions like rate or increase, results are extrapolated and may not be exact.
99th percentile of API request latencies
sum by (operation)(histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[1m])))
50th percentile of S3-compatible API latencies
sum by (operation)(histogram_quantile(0.5, rate(gateway_request_duration_seconds_bucket[1m])))
Number of errors in outgoing S3 requests
sum by (operation) (increase(s3_operation_duration_seconds_count{error="true"}[1m]))
Number of open connections to the database
go_sql_stats_connections_open