Link Search Menu Expand Document

Monitoring using Prometheus

Example prometheus.yml

lakeFS exposes metrics through the same port used by the lakeFS service, using the standard /metrics path. An example prometheus.yml could look like this:

scrape_configs:
- job_name: lakeFS
  scrape_interval: 10s
  metrics_path: /metrics
  static_configs:
  - targets:
    - lakefs.example.com:8000

Metrics exposed by lakeFS

By default, Prometheus exports metrics with OS process information like memory and CPU. It also includes Go-specific metrics such as details about GC and a number of goroutines. You can learn about these default metrics in this post.

In addition, lakeFS exposes the following metrics to help monitor your deployment:

Name in Prometheus Description Labels
api_requests_total lakeFS API requests (counter) code: http status
method: http method
api_request_duration_seconds Durations of lakeFS API requests (histogram)
operation: name of API operation
code: http status
gateway_request_duration_seconds lakeFS S3-compatible endpoint request (histogram)
operation: name of gateway operation
code: http status
s3_operation_duration_seconds Outgoing S3 operations (histogram)
operation: operation name
error: “true” if error, “false” otherwise
gs_operation_duration_seconds Outgoing Google Storage operations (histogram)
operation: operation name
error: “true” if error, “false” otherwise
azure_operation_duration_seconds Outgoing Azure storage operations (histogram)
operation: operation name
error: “true” if error, “false” otherwise
kv_request_duration_seconds Durations of KV requests(histogram)
operation: name of KV operation
type: KV type(dynamodb, postgres, etc)
dynamo_request_duration_seconds Time spent doing DynamoDB requests operation: DynamoDB operation name
dynamo_consumed_capacity_total The capacity units consumed by operation operation: DynamoDB operation name
dynamo_failures_total The total number of errors while working for kv store operation: DynamoDB operation name
pgxpool_acquire_count PostgreSQL cumulative count of successful acquires from the pool db_name default to the kv table name (kv)
pgxpool_acquire_duration_ns PostgreSQL total duration of all successful acquires from the pool in nanoseconds db_name default to the kv table name (kv)
pgxpool_acquired_conns PostgreSQL number of currently acquired connections in the pool db_name default to the kv table name (kv)
pgxpool_canceled_acquire_count PostgreSQL cumulative count of acquires from the pool that were canceled by a context db_name default to the kv table name (kv)
pgxpool_constructing_conns PostgreSQL number of conns with construction in progress in the pool db_name default to the kv table name (kv)
pgxpool_empty_acquire PostgreSQL cumulative count of successful acquires from the pool that waited for a resource to be released or constructed because the pool was empty db_name default to the kv table name (kv)
pgxpool_idle_conns PostgreSQL number of currently idle conns in the pool db_name default to the kv table name (kv)
pgxpool_max_conns PostgreSQL maximum size of the pool db_name default to the kv table name (kv)
pgxpool_total_conns PostgreSQL total number of resources currently in the pool db_name default to the kv table name (kv)

Example queries

Note: when using Prometheus functions like rate or increase, results are extrapolated and may not be exact.

99th percentile of API request latencies

sum by (operation)(histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[1m])))

50th percentile of S3-compatible API latencies

sum by (operation)(histogram_quantile(0.5, rate(gateway_request_duration_seconds_bucket[1m])))

Number of errors in outgoing S3 requests

sum by (operation) (increase(s3_operation_duration_seconds_count{error="true"}[1m]))

Number of open connections to the database

go_sql_stats_connections_open

Example Grafana dashboard

Grafana dashboard example